The knowledge of the function of proteins is the most fundamental prerequisite for understanding the biological processes in a cell or organism. Whereas the genomes of a growing number of species have been or are being completely sequenced (around 1000 today), information on the function of their proteins is still limited, often partial, inaccurate, or even plainly wrong. This section concentrates on the various bioinformatics approaches that aim at narrowing the gap between the knowledge on protein sequences and protein function. There are several recent reviews focusing on this topic.183,184
22.214.171.124 What is Protein Function?
What we perceive to be the function of a protein depends on the particular context. From the biochemist's point of view, the function of a protein refers to its binding partners, the respective modes of binding, and the reactions it may catalyze. From the biologist's point of view, the role of the protein in a complex biological process (cell cycle, apoptosis, etc.) or even its localization inside the cell (cytosolic, membrane standing, extracellular, etc.) is an attribute of its function. The different meanings that the notion 'protein function' can acquire in different contexts calls for a suitable ontology, that is, a structured vocabulary for talking about protein function.185
Here we present some of the work done to establish a unifying ontology for protein function.186 The first approach is the classification of enzymes by the Enzyme Commission.187 This notational scheme classifies enzymes hierarchically by the reactions that they catalyze. The location in the hierarchy is represented by the so-called EC number, a four-level numerical code (similar to IP numbers on the internet) in which each successive number homes in on the catalyzed reaction to a growing level of detail. Invented more recently, the MIPS Functional Catalog contains a larger variety of protein function categories.188 Originally defined for handling functional annotations of the yeast genome, the MIPS Functional Catalog has been broadened to apply a wide range of organisms. Like the EC numbers, the MIPS categories are organized in a hierarchical treelike structure. However, the MIPS catalog has six levels, whereas the EC uses only four levels. One of the most extensive efforts in establishing an ontology for protein function was undertaken by the international Gene Ontology Consortium, which maintains the Gene Ontology (GO).189 Similar to the approaches described above, the GO is organized hierarchically, but, here, there are three separate hierarchies representing three fundamentally different views on protein function: cellular component, molecular function, and biological process. Furthermore, the GO categories can belong to several supercategories, for instance, the category
'cell growth and/or cell maintenance' refines the two categories 'cellular process' and 'physiological processes.' This generalizes the treelike structure of the hierarchy to that of a directed acyclic graph.
Such ontologies are valuable tools for categorizing proteins by their function. They are under continuing development, and, to a growing extent, functional annotations of genes and proteins are being linked to the ontologies in databases or in the literature. These annotations provide an essential basis for today's protein function prediction programs. But when dealing with phenomena such as functional relationships between proteins, multiple protein functions, and function variability of a protein under different physiological conditions, a protein function annotation guide is necessary, in addition to the largely context-free hierarchical ontology. Although the GO provides a basic annotation guide with its ontology, the need for a highly expressive, consistent, and standardized framework for annotating protein function still remains. As the first larger regulatory networks are being understood,190 initial efforts are being undertaken to develop the expressive tools for this purpose. An example of an application in a supracellular scenario has also been presented.191
The structure of a protein is mainly determined by the sequence of its coding gene. Since the structure of a protein determines its function, the function of a protein must be encoded largely by the sequence of its gene. This observation directly leads to the conclusion that two genes that are similar in sequence should code for proteins with similar function. In many cases, this conclusion is applicable to the prediction of protein function from sequence, especially to those parts of the protein that are most relevant to its function. On the other hand, there are pairs of proteins whose sequences are very similar but their function is significantly different. As a rule, sequence similarity originates from a common evolutionary origin. Evolutionarily related genes or proteins are also called homologous. Homologous genes can occur in different or the same species. As a gene is handed down from generation to generation, its sequence may change slightly, whereas the function of the gene remains the same. Such ancestor-descendant relationships between genes are denoted by the term 'orthology.' Thus, orthologous genes are found in different species that have a common ancestor in which the genes originate from the same ancestral gene. In general, gene function is maintained in orthologous genes. However, sequence similarity can also arise due to the duplication of a gene within the same organism. A gene duplication results in two identical copies of the gene within the same genome. One of the copies is submitted to the selective pressure to maintain the original function of the gene. The second copy can acquire a new function. Such genes that have a common evolutionary origin but are separated by a gene duplication are called paralogous. It is hard to determine from sequence similarity alone whether two genes are homologous or paralogous, and, in fact, the exact definitions of orthology and paralogy are still subject to debate.192 Since orthologous proteins are likely to have the same function, the detection of orthology is one of the mechanisms used to infer protein function from sequence. The reporting of function due to paralogous gene pairs leads to false predictions in this context. It is obvious that a correct computational categorization of homology in orthology and paralogy requires the application of phylogenetics. There have been approaches to identify gene duplications in the ancestral history of the involved species, but they are mathematically complex and computationally demanding.193'194 More practical is the use of heuristics, such as the program Orthostrapper, which only analyzes single pairs of genes and does not aspire to map complete phylogenies.195
As an approximation to phylogenetic analysis of orthology, one can use whole-genome sequence data, and reduce the concept of orthology to a pure sequence similarity criterion. Specifically, if two proteins in two different genomes retrieve each other as the top hit in a sequence database search (e.g., using BLAST) of the respective genome, then there is a good chance that they are orthologous. One can extend this analysis to larger groups of genes. A set of genes, each of them in a different species, which point to each other as the top hit in a BLAST search, forms a so-called cluster of orthologous genes (COGs). A database of COGs has been developed.196,197 Since such an approach foregoes all phylogenetic analysis it is likely to produce quite a number of false-positive hits. Hence, the resulting COG database has to be carefully curated manually. Another protein database that provides orthology information based on this model is SMART.198
The application of the concept of orthology is not limited to complete gene or protein sequences. Only a fraction of the sequence of a gene actually codes for those elements of a protein that are characteristic for a specific function. This affords an approach to function prediction by simply searching the protein for such short patterns of sequence that are indicative of a certain function, the so-called sequence motifs. For example, a well-known motif for an ATP-binding site is the P loop. Its characterizing sequence fragment has length 8, respecting the pattern [AG]-x(4)-G-K-[ST]. This motif pattern defines positional constraints for an amino acid sequence: the single upper-case letters denote unique residues according to the single-letter code for amino acids. The letters in square brackets designate the respective choices of one amino acid from two (or more, in general) available alternatives, and x(4) represents an insertion of four arbitrary adjacent amino acid residues at the respective position. Such motifs are typically retrieved from multiple sequence alignments (see Section 126.96.36.199) of protein sequences that are known to share a common function. From such an alignment, the highly conserved regions of the sequences are selected to define the characteristic positional constraints of the motif by consensus. The first database utilizing sequence motifs for the representation of functional sites of proteins was PROSITE.199,200 Whereas motif generation started exclusively manually, methods have been developed to derive them automatically.201,202 The weighting of sensitivity against specificity is a central issue in motif generation: a strict motif is likely to reject true positives - proteins that have the function represented by the motif. On the other hand, an imprecise motif may accept many false positives: protein sequences that contain the motif but do not have the respective function. In general, different motifs can be combined for a search, in order to ensure that most of the true positives comprise at least one of these motifs. The EMOTIF Database203 contains motifs generated with this approach.
PSSMs or HMMs (see Sections 188.8.131.52, 184.108.40.206, 220.127.116.11, and 18.104.22.168.2) are other approaches for characterizing conserved regions in protein sequences. They are more suitable for longer sequence fragments that span segments up to complete protein domains, and they can be generated easily on the basis of multiple alignments.204 Several databases that make use of PSSMs and HMMs in this context exist. Several motif and domain databases have been integrated in the domain database InterPro.205
Motif and domain databases are complemented by supervised learning methods, primarily based on neural networks. They are able to predict many aspects of protein function; for example, cellular localization (through the analysis of signal peptides and the prediction of transmembrane helices) and posttranslational modification features (glycosylation and phosphorylation sites). The Center for Biological Sequence Analysis at the Technical University of Denmark, Lyngby, maintains a server that offers many such methods, and integrates them in the ProFun method. This method classifies proteins according to their predicted function (e.g., enzyme class or participation in a biological process such as amino acid biosynthesis or energy metabolism). The overall functional prediction is made on the basis of a large number of methods analyzing or predicting features of the protein that can be derived directly from the sequence, simple ones that can be computed directly (e.g., sequence length, charge and amino acid composition), and more complex ones that need to be predicted (e.g., secondary structure, signal peptides, and sites for posttranslational modifications). The predictors for some function classes reach high rates of accuracy, in some cases up to 90%.206
Protein-protein interactions make up an important part of all intracellular interactions. In the late 1990s, assays for measuring protein interaction data were developed by scaling of the yeast-two-hybrid (Y2H) method207 to cover substantial parts of a genome.208 In principle, these procedures generate a binary matrix with the dimensions No. of proteins x No. of proteins defining, for each pair of proteins, whether they interact or not. Another procedure, named tandem-affinity purification, can to latch onto a protein and pull out with it a whole attached complex of proteins. The result is a set of protein complexes instead of a binary matrix.209 Bioinformatics can utilize these data for the analysis of cell-wide protein networks.210,211
Both procedures generate interaction data that are contaminated by a significant number of false negatives and false positives. One explanation for this is that laboratory procedures inadequately imitate the natural cellular conditions, which biases protein binding. For instance, laboratory procedures are notorious for being unable to differentiate between binding events in different physiological states to distinguish between transient binding partners and those that bind to each other over longer time periods. Still, the resulting data are used as a basis for the construction and bioinformatics analysis of protein interaction networks,212-215 These methods mainly follow the unsupervised learning paradigm (i.e., they cluster the data). The resulting clusters represent putative groups of functionally related proteins.216 This 'guilt by association' approach can also point to interesting drug targets.217 The global topology of the constructed interaction networks is a target of fundamental studies.218,219 In the face of the noise present in protein interaction data, the combined analysis of protein interaction data and other protein function data appears most promising (see Section 22.214.171.124). The amount of publicly available protein interaction data in repositories is growing significantly.220
The presence of a growing number of completely sequenced genomes has afforded entirely new ways of approaching the quest for elucidating protein function. The new power afforded by complete genomes is that one can reason not only about the presence of a protein but also about its absence. Analysis of patterns of groups of genes occurring in different organisms by so-called genomic context methods is a powerful tool for obtaining information on protein function. In order to discern functional associations in proteins on the basis of genome sequences, approximately 30 completely sequenced genomes are required. Today, this number has been reached for prokaryotes, but not for higher organisms such as mammals. The database STRING provides predictions of the functional association between proteins that has been derived with genomic context methods among others.221
Genomic context methods usually reveal functional associations between proteins that are more general than physical binding. Additional functional associations include taking part in the same biological process (cell cycle, apoptosis, etc.) or cooperating in producing a certain phenotype (genetic association). Several genomic context methods are briefly discussed below.
When two genes in two species occur in close proximity in both genomes, they are likely to be functionally associated. This evidence becomes stronger if the number of involved species increases.222 The respective order of the genes along the genome may also be used for the analysis.223 There are approaches that generalize this method by revealing general
conservation patterns that present cycles of association that alternate between homology and neighborhood. ,
126.96.36.199.2 Domain fusion
This method utilizes the observation that two protein are likely to be binding partners if their genes are fused in another species.226,227
The concurrent presence or absence of two proteins in a species may provide evidence for their functional association.228 The corresponding comparison of proteins from different species has been based originally on a binary classification of protein pairs in orthologous and not orthologous. This approach has been improved through the incorporation of gradual levels of evolutionary distance. STRING and PLEX are servers that provide analysis of protein function based on phylogenetic profiles.
In general, the effectiveness of genomic context methods is much higher for prokaryotes than for eukaryotes. This follows not only from fewer fully sequenced eukaryotic genomes but also from the organization of genes in prokaryotes (e.g., their collection in operons), which is more suitable for the application of genomic context methods.
The structure of a protein determines the interactions that it can perform with other molecules. These interactions are the basis of the functional role of the protein. Methods that deduce protein function directly from structure are still evolving. Automating this process is quite difficult: on the one hand, the same fold - originating from gene duplication or convergent evolution - may realize different functions. On the other hand, the same function (e.g., catalysis of a specific reaction) can be attained with various protein folds.229 This ambiguity can be eliminated with the help of detailed orthology analysis (see Section 188.8.131.52). Furthermore, remote homologies can suggest a functional relationship. For example, two remotely homologous enzymes can share the same reaction mechanism, while differing in substrate specificity.230
The comparison of protein structures provides the methodical basis for analyzing the similarities and evolutionary relationships of proteins in terms of structure, similar to alignment methods that analyze the evolutionary relationship of proteins in terms of sequence. The conventional approach to structurally comparing proteins is to superpose their (rigid) structures. At first, the two protein chains are aligned structurally (i.e., pairs of amino acids, each from one of the proteins, are matched if they occupy corresponding spatial locations in the protein structures. In principle, the algorithms used for sequence alignment (see Section 184.108.40.206) are also applicable to this task. However, the scoring function has to be adapted, since now it no longer measures sequence evolution but structural similarity. The simplifying assumption of sequence alignment, dealing with the residues as independent from their neighbors in the sequence, cannot be made either, since residues that are adjacent in space influence each other.
There are several automatic structural superposition methods that were used to create structural protein classification databases. The database CATH231 provides one of the most popular structural classifications of proteins; it has been derived automatically and uses the structural superposition method SSAP.232,233 The only structural protein classification of comparable popularity is SCOP. This database actually achieves a higher classification consistency, because it is curated manually.234 Other protein structure superposition methods include DALI/FSSP,235'236 CE,237'238 and FATCAT.239 New algorithms also take the structural flexibility of proteins into account.240 As in sequence alignment, it is also possible to multiply align protein structures.241 Recent reviews provide comprehensive overviews of the available methods.242'243
The structural similarity of functionally related proteins allows for the derivation of structural motifs from their (structural) alignments; the most prominent motifs are specific ligand-binding sites. Various algorithmic methods for detecting such motifs have been developed.244-248 Attempts to calculate the statistical significance for the occurrence of a motif have also been made, since this is a critical issue in structural comparison, as it is in sequence search.246'249
Since the functional sites must be accessible for interaction partners, they are usually located at the protein surface. Therefore, the analysis of the protein surface may help identify functional sites - a significant contribution toward elucidating protein function. Early methods for searching for functional sites identify geometric features such as clefts and cavities along the protein surface.250 This approach has been extended to considering the physicochemical properties of the protein surface in order to improve prediction accuracy.251-253 Other approaches utilize the evolutionary conservation of functional residues,254-256 and Jones and Thornton provide an overview of recent developments.257 The rating of residue conservation among homologous proteins requires an accurate molecular clock measuring the speed of evolution.258 The ConSurf server provides an interface for an online annotation of the protein structure with the residue conversation that is calculated among the protein and its homologous.259
The literature is the most comprehensive source of information on protein function. However, this information is not easily accessible by computer. Text mining is the computer-based approach used to harvest this information. The two major classes of text-mining methods address the problems of information retrieval and information extraction, respectively. Information retrieval calls for the selection of documents according to some user-defined criteria. In contrast, information extraction accesses pieces of information comprising prespecified types of events, entities, or relationships from documents. Information extraction is much harder than information retrieval. Early experience in text mining has been gathered mainly in the newswire domain, but the insights gained there are not directly applicable to text mining in the protein function context. The main reason for this is that articles published in the newswire domain usually target a general audience, whereas biomedical articles usually aim at a small community of domain experts, and, thus, are more difficult to interpret. Current efforts mainly aim at adapting established text mining methods to the characteristics of the biomedical literature. The available approaches employ natural language-processing techniques, ranging from direct pattern-matching approaches260 to customization of established natural language-processing systems.261
The reliable extraction of relationships between proteins, genes, diseases, and other biological entities is a central goal of text-mining methods in the biological context. There are two main classes of problems that make progress in this field difficult. The first comprises general natural language problems that include tracking references to the same object consistently throughout the text and detecting precisely hypothesis and their negations. The second covers domain-specific matters that concern the highly variable nomenclature for genes and proteins, for instance. Nevertheless, recent approaches that specialize in certain tasks have proved to be quite effective. Such methods focus on the extraction of facts concerning the subcellular localization of proteins262 and protein-protein interactions,261 for example. One approach that relies on machine-learning techniques has recently been reported, combining information retrieval and extraction for the determination of protein-protein interactions.263 The system described there helps maintain the BIND database of protein interactions, among other applications. Further information can be found in recent reviews.264-267
Whereas information retrieval methods have reached performance levels that justify their immediate use,263 the results provided by information extraction methods must be used with caution. Method improvement is necessary. Furthermore, the field would benefit greatly from an evaluation standard. Such a standard would also help in determining the algorithms most suitable for retrieval and extraction tasks. The pace of idea exchange and the development of an evaluation standard can be accelerated by the launch of critical assessment contests.268 The established TREC conference on text-mining methods supports these efforts, and has started a genomics track. Launched in 2003, the BioCreAtIvE (Critical Assessment for Information extraction Systems in Biology) competition assesses text-mining methods in the biomedical domain.
The protein function prediction methods described in this section are based on a variety of data, including sequence data, structure data, protein interaction data, literature data, mRNA expression data, and genomic context data. Each kind of data gives a different type of hint about protein function, and, typically, the reliability and specificity of the prediction based on a single kind of data is quite limited. This suggests integrating the weak or partial hints coming from each data source in order to arrive at a more comprehensive, specific, and reliable prediction of protein function. The situation is rather like that of having to identify a criminal: hints from all kinds of backgrounds, such as circumstantial evidence, witnesses, family background, and personal history, have to be integrated in order to rank list suspects.
The integration of information on protein function has become an important field in computational biology. Methods have been reported that evaluate protein interaction data in combination with mRNA expression data.269,270 Marcotte et al., presented the first study integrating several genomic context methods with primary experimental data and mRNA expression data.271 The Bork group followed with two studies, one concentrating on genomic context methods272 and the other on protein interaction data.273 More recently, these researchers provided the database STRING, containing precomputed functional associations based on genomic context data.274 Phylogenetic profiles seem to be the most powerful genomic context method. Domain fusion analysis contributes strong signals, but it is applicable only to a small subset of proteins. Surprisingly, mRNA expression data and protein interaction data carry comparatively little signal. The STRING server has been used to recover functional modules in E. coli.275 Protein interaction data have been put into the context of other sources of signals for protein function.273,276,277 All of these studies indicate that, when integrated, the methods are much stronger than any single method by itself.278-280
Was this article helpful?
Forget Silly Diets-They Don't Work. Weight loss has got to be the most frustrating experience for many people, young and old alike. Eating foods that are just horrible, denying yourself foods you truly love and enjoy. Exercising, even though you absolutely hate exercising, and end up stiff as a board with no results.