Sequence-based protein families are classified according to a profile derived from a multiple-sequence alignment. The profile can be shown across a long domain (typically 100 residues or more) or can be revealed in short sequence motifs. Classification methods based on profiles across long domains tend to be more reliable but less sensitive than those based on short sequence motifs.
Several sequence-based methods focus more on profiles across long domains, including Pfam (Bateman et al., 1999), ProDom (Corpet et al., 1999), SBASE (Murvai et al., 1999), and Clusters of Orthologous Group (COG; Tatusov et al., 1997). These methods differ in the techniques used to construct families. Pfam builds multiple-sequence alignments of many common protein domains using hidden Markov models. The ProDom protein domain database consists of homologous domains based on recursive PSI-BLAST searches (unit 2.5). SBASE is organized through BLAST neighbors and is grouped by standard protein names that designate various functional and structural domains of protein sequences. COG aims toward finding ancient conserved domains by delineating families of orthologs across a wide phylogenetic range.
The following shows an example of Pfam for the GRIP domain (accession number PF01465). Pfam lists some useful information for the entry as follows:
The GRIP (golgin-97, RanBp2alpha, Imh1p and p23 0/golgin-245) domain is found in many large coiled-coil proteins. It has been shown to be sufficient for targeting to the Golgi. The GRIP domain contains a completely conserved tyrosine residue.
The references of the above annotation are also given. In addition, Pfam gives the alignment between the family members:
KNEKIAYIKNVLLGFLEHKE----QRNQLLPVISMLLQLDSTDEKRLVMS Q06 7 04
MLIDKEYTRNILFQFLEQRD----RRPEIVNLLSILLDLSEEQKQKLLSV O42 6 57
EPTEFEYLRKVMFEYMMGR-----ETKTMAKVITTVLKFPDDQAQKILER 07 03 6 5
STSEIDYLRNIFTQFLHSMGSPNAASKAILKAMGSVLKVPMAEMKIIDKK Q18 013
The alignment shows accession numbers and the range of each sequence. One can identify some features of the family through this pattern (i.e., from particularly conserved residues at specific alignment positions).
Some methods are based on "fingerprints" of small conserved motifs in sequences, as with PROSITE (Hofmann et al., 1999), PRINTS (Attwood et al., 1999), and BLOCKS (Heniko et al., 1999). In protein sequence families, some regions have been better conserved than others during evolution. These regions are generally important for the function of a protein or for the maintenance of its three-dimensional structure, and hence are suitable for fingerprinting. The fingerprints can be used to assign a newly sequenced protein to a specific family. Fingerprints are derived from gapped alignments in PROSITE and PRINTS, but are derived from ungapped alignments (corresponding to the highly conserved regions in proteins) in BLOCKS. A fingerprint in PRINTS may contain several motifs from PROSITE, and thus may be more flexible and powerful than a single PROSITE motif. Therefore, PRINTS can provide a useful adjunct to PROSITE. It should be noted that some functionally unrelated proteins may be classified together due to chance matches in short motifs.
Other sequence-based protein family databases consist of multiple sources. The ProClass database (Wu et al., 1999) is a nonredundant protein database organized according to family relationships as defined collectively by PROSITE patterns and PIR superfamilies. The MEGACLASS server (States et al., 1993) provides classifications by different methods, including Pfam, BLOCKS, PRINTS, ProDom, and SBASE. The MOTIF search engine at http://motif.genome.ad.jp/ includes PROSITE, BLOCKS, ProDom, and PRINTS.
Was this article helpful?
This course covers everything that you could ever want toknow about getting high rankings in the search engines. Many courses only give you a little bit of information and then try to sell you additional courses with the real secrets in them. Youll never have to worry about that with this course.