Help

This page introduces the concepts used and methods provided by the P-Type ATPase Toolbox. The features available are:

The classification methods both take a FASTA formatted file as input. The annotated version of UniProt is browsable through an easy-to-use interface which allows for filtering of the entries and easy extraction of the results.

FASTA format

The FASTA format is the most widely used format for storing sequences. Below, an example of a FASTA file is shown.

>header of the first entry
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
>header of the second entry
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCA

Each line beginning with a > signifies a new entry in the file. The text following the > is arbitrary and can thus contain e.g. the accession id of the entry or some information about the sequence belonging to the entry.

A FASTA formatted file must be saved as plain-text using e.g. Notepad. Common extensions are .fa and .fasta.

Methods for Classification

Once you have created or obtained a FASTA file with P-Type ATPase sequences which you want to know the sub-type of, copy-and-paste the contents of the file (which must be opened in e.g. Notepad) into the text area on the classification page

We provide access to two different classification methods. The main method based on k-nearest neighbors is described in [1] which provides the best classification accuracy, while the secondary method is based on a method described in [2], which is faster, but provides a lower classification accuracy. We therefore recommend the use of the main method, unless speed is essential.

When a classification job has been submitted it will run as soon as resources are available and the results will be shown. The query sequences are shown grouped by their predicted subtype.

A plot is generated showing the number of sequences predicted to belong to each subtype. Clicking a bar of a specific subtype in the plot will scroll to the results for that subtype.

The results can be downloaded as a comma-separated file which can be opened with Excel or any other spreadsheet program for further analysis.

Browsing the Database

The database made available by PATBox is a resource for exploratory research and discovery of P-Type ATPases. The database is constructed from the UniProtKB database through the following steps:

  1. scan through all sequences and only keep those containing the PROSITE motif D-K-T-G-T-[LIVM]-[TI] (PS00154) characteristic of P-Type ATPases,
  2. classify the sequences with the k-NN and SeqL methods,
  3. annotate each sequence with the predicted subtype,
  4. if a sequence has a known subtype (ie. is in the curated dataset), annotate it with the known subtype,
The database thus provides easy access to probable P-Type ATPase sequences and their predicted subtype.

Contact

Questions, comments, feature requests, and bug reports are very welcome and should be directed to Dan Søndergaard at das@birc.au.dk.

  1. Søndergaard, D., Pedersen, C.N.S. PATBox: A toolbox for classification and analysis of P-Type ATPases. 2015. (Pending publication).
  2. Ifrim, G. & Wiuf, C. Bounded coordinate-descent for biological sequence classification in high dimensional predictor space. 2011.