PROSITE (Falquet et al, 2002) is a database of patterns and profiles. PROSITE patterns are built from sequence alignments of related sequences taken from a variety of sources, e.g. from a well characterized protein family or derived from the literature. The alignments are checked for conserved regions, which may have been experimentally shown to be involved in the catalytic activity or to bind a substrate. A core pattern is created in the form of a regular expression, which specifies which amino acid may occur at each position. In this regular expression, where a position is conserved throughout the alignment, only one amino acid is specified; however, it may be that one of two amino acids may occur at a position, and this is described symbolically by [AC], which suggests that this position may be occupied by either alanine (A) or cysteine (C). An x is used to indicate that any amino acid may occur for this or more than one position (x(3)), and curly brackets {} are used to indicate which amino acids may not occur at a given position. As an example, the pattern [AC]-x-V-x(4)-{ED} describes a region of sequence where position 1 is occupied by A or C, position 2 may be any amino acid, position 3 is a valine (V), positions 4 to 7 may be occupied by any amino acid, and position 8 may be any amino acid except glutamic acid (E) or aspartic acid (D). Once a core pattern has been identified, this is tested against the sequences in Swiss-Prot. If the correct set of proteins match this pattern then it is kept; if it fails to pick up some family members or picks up too many unrelated proteins, the pattern is refined and retested until it is optimized.

Patterns have many advantages, but they also have their limitations across whole sequences, which is why PROSITE also creates profiles for their database to complement the patterns. A profile is built starting with multiple sequence alignments, and using a symbol comparison table to convert residue frequency distributions into weights, resulting in a table of position specific amino acid weights and gap costs (Gribskov, Luthy and Eisenberg, 1990). In other words, profiles are matrices describing the probability of finding an amino acid at a given position in the sequence. The numbers in the table (scores) are used to calculate similarity scores between a profile and a sequence for a given alignment. For each set of sequences a threshold score is calculated so that only sequences scoring above this threshold are true matches (considered to be related to the original set of sequences in the alignment). The profile is tested against sequences in Swiss-Prot, and the profile is refined until only the intended set of protein sequences scores above the threshold score for the profile. Profiles produced by the PROSITE database begin as preliminary profiles, and once they have been tested and approved they become integrated as new members of the database.

