Molecular Sequence Analysis

Biological, pharmaceutical, and medical applications do not only require sequences, they need a functional description of organismic, especially human, life. The completion of genome sequences or even creation of rough drafts is a crucial step toward this goal, but there is still much more work to be done: the genome has to be mined for genes and for regions that regulate gene expression. Heritage and genetic variations between single individuals have to be analyzed, to assist the study of the relationship between diseases and genes. There are many computational tools that provide varied support for these efforts. Basic software tools that concentrate on the computational analysis of genome sequences are compiled in utilities such as the Staden Package13 or the Wisconsin Package (GCG).14

3.15.3.1 Sequence Alignment

Sequence alignment is at the basis of analyzing the evolutionary and functional relationships between molecular sequences. An alignment of two sequences is an arrangement of both sequences, one written below the other. Matching sequence positions are written one on top of the other. (If two characters in such a column are different, we speak of a 'mismatch'; if one sequence is missing a character, a gap character (' —') is used. Figure 1a illustrates a

Hemoglobin alpha-1 Hemoglobin beta

Hemoglobin alpha-1 Hemoglobin beta

Hemoglobin alpha-1

Hemoglobin beta (a)

1 MVLSPADKTNVKAAWGRVGAHAGEYGAEALERMPLSFPTTKTYFPHF-D 48 :. I: I. : I 1.1.11 11 :.: 1. 1: 11 I . I : : :.: I.I :. : I. . t |

1 MVHLTPEEKSABTALNGRV--NVDEVGGEALGRLLVVYPNTQRFFESFGD 48

49 LS H GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLR 93

II. I : :: I I : [I I I I . : I :: : :: I I : I I I : I I . I II :

49 LSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLH 98

94 VDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR 142

I 11.1 h I I f :. I: I I. I :,, 11 I Ii Ii I ;t.il ! I IJ. II :

99 VDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH 147

Myoglobins <

Hemoglobins penguin chicken human gorilla - -chimpanzee kangoroo opossum rabbit cattle hi rat mouse turtle human_alpha-1 nv-| rat_alpha-1 human_beta rat_beta nvm B

havljtp hzvlkr:.

LlRUxq I LIRLPFGHtE lislp:'G«i b UVLIRLPFGltl

LVL LVL

LVLKAHGKVSÀeiPOMiVLIHLrMipe LV1JÎVHGKVBAD1.AC1IG0ÎVL ÎR1P :■! I «PB

BDI.RKBOATvJofcGK Î1. ■ÄfEDLKKStT MVlMtGQJ I !NKA| : FJCK3AJVI.TALGG: L (.MFMCKKUCftESEMKASEDLMCaGATVL TALGG: L LBirDKPItl LKSKBK«KnSHDLKXBOI|vJ*MNlt LiEK PUCRK LKSEDKMK KDLXKBGAfVLlfça n|b LEKFCF.FF^lltSEB£?IX.=,SEßLXX»i;:a\rLTALaAIL

VLNAHGKVBADVAGHGaiVL IB L F KiipBTLSI! KtJX b*XK£Jc|iA*iaiAlEDLXXKjHfVLTALGG. ui I «cnrEGDLAGH z ils lffjiB i ETiEx:-nxFXKUcsEEEiu:siE2LixH<!c|vLTALaT: lvlnvnokvsadlkcb^ jevliolpf. tb! etil it pc ï f f.n lki iei-1u[c;£el _f.f_hli 3vltalgi. blk-gïmaicvhfotsajîdoev-i ihlïivtlt et ¿erfaxpx-vlxjl delr .-"seeyf.f.h ; 7tvltalq.

<ktn]î!uahckvi»iiaoetaafaleiuiplsiptwervfpht-d|5h g*afvxgebkkva:al

|*TKÏK*CHnKlO<3KOOE»aEEAI.0SJlïAAPCTlKTYÏSHJ-DfsP-- OiAOPUilGI'irVArAL.V: ksavf.mhgxvkvo -ekcokalcihl:/-."."*! v.toirf^srgdt.s|p|^vmbnp.fvica|iai!xvlgaf.-kajwnof.maniipi- • d1w!g£al01lbtvvl'>w|orviflstqdts|asalmgn pk wagokk vi ku fs |

ruler

Myoglobins

Hemoglobins penguin chicken human gorilla chimpanzee kangoroo opossum rabbit cattle rat mouse turtle human_alpha-1 rat_alpha-1

li'varuFii

HATKKKI '.'KÏI.ÏPISBC: ATFHKI j'VKYLEPI SSCI KI;VHYLBPI«BCI

XI.VCFLBH XIFVHTLEFISEA: ANXBX1IVSTLBFIS IflîUAgEHATXHK: : vkylefis ig: LAÇÎHA7XKKI : VF.YLEFISBII UtrLABCliATiaiK: : vKYLSrieill i mhhkuiv: p rarx Li.s -

NKALI1 rRKMMKfKILGr GADAOGMf r-XA II t f S ■ DK» S'i YXBLGP «MAOSAKMCAlELHI'OKASNWMLOr prOADAOAAK'.XALBLFRHBKAXKMCBPGr 'FCGDAQAAHGKALELFHiiDKAnXYXBLGF

PCAIAOAAW-XALELPFJCilA-OYXELCP 5DFCAHAÜAA!(:'XA1E L PRJiDMA.\aWVI,GFHO ODPGADAO-A«; XALELFaaBlAiKYXELaFfr-; GDPCADABSAH-KALBLFSjraiAAnXELGFOG s|FOAO:~BAAN!.KALBLFI!|lDHA8XY*BP0Pi0 AMrPAyHA^LL.KFLASBSTVtTMn- - - - - -B ■pAgu^toerBAsi^niTMBi

human_beta AHLtKtKQTrATlISililiEICeKiDPEiilJltLGirVI.VCVt.AHKFG-' BTPPUBWlC'lCVVAGVAfAtAKltiH - - -rat_beta ruler

154 153 153 153 153

154 154 154 153 142 142 147 147

Figure 1 (a) Global alignment of the human hemoglobin a1 and hemoglobin b sequences. The numbers on the left and right sides of the alignment represent sequence positions at the end of each row. The signs in the middle row signify similarities of matched amino acid residues: |, identical; :, similar; ., not similar. (b) Multiple alignment of several myoglobin and hemoglobin sequences. The alignment was generated with ClustalW. The colors represent chemical groups of amino acid residues. The bar graphs represent the degree of conservation of the alignment columns.

sample pairwise alignment of the human hemoglobin ai and b chains. Figure 1b shows a multiple alignment of myoglobin and hemoglobin sequences from different organisms. Depending on the context, matches/mismatches can have varying meanings. They can indicate evolutionary, structural, or other correspondence. In evolutionary correspondence, the two sequence positions can be referred back to the same sequence position in a (lowest) common ancestor. In structural correspondence, the two residues lie on top of each other in the structural superposition of the two proteins. A gap signifies an insertion into the sequence that does not have the gap character or a deletion from the sequence that has the gap character. We cannot tell which of the two is the case since the alignment model is symmetric with respect to the relationship between the sequences. Thus, alignment columns with gap characters are also called 'indels.'

Sequence alignment algorithms are based on probabilistic models for the occurrence of positional (mis-)matches.15 The probability that the two sequences stem from a common origin in the way that the alignment represents is the product over all alignment columns of the probabilities of seeing the (mis-)matches or indels in these columns. The match probabilities are derived from some database of match events. In the evolutionary context, the database contains manually curated multiple alignments that are supposedly biologically correct16'17. Usually it is assumed that the (mis-) match probabilities only depend on the nature of the affected nucleotides or residues and are independent of their neighbors. A string of adjacent indels (called gaps) is treated as a unit. The alignment that has the highest probability is assumed to be the biologically correct one according to this model. Early approaches arrange the costs for (mis-)matches into so-called substitution matrices. For protein sequences, the PAM and/or BLOSUM matrices are the most widely used. The number that is part of the matrix name is an indication of the targeted evolutionary distance. For example, PAM-250 is a mutation matrix for sequences that have an evolutionary distance that allows for 250 mutations (PAM, point accepted mutations) in every 100 sequence positions, on average. Since the number of mutations increases with rising evolutionary distance, the correct choice of the substitution matrix is essential for the quality of an alignment. Recently, a method for scoring sequence alignments has been presented that adaptively determines the evolutionary distance of the aligned sequences.20 Usually, the cost of deletions and insertions are assumed to depend linearly on the gap length, and their values are determined by heuristics or fitted to a data set of trusted alignments.21

Using such a cost theme, algorithms are able to solve the optimization task very efficiently via the dynamic programming paradigm.22 They can be used to either align complete sequences (global alignment) or only some parts of both sequences (local alignment). In the case of pairwise alignments, the runtime of these algorithms grows linearly in the product of the lengths of the two sequences. This is acceptable for single alignment queries, but unacceptable for similarity searches in large databases. For database searches there are other algorithms, of which BLAST is the most popular.23 This heuristic algorithm is based on gapless alignment, and tries to optimize a local similarity measure. Its popularity does not only arise from its speed but also from the fact that it provides a significance value (p value). BLAST is a rare example of a major bioinformatics tool that provides a confidence estimate for its output that is based solid statistical theory.24

Things become more complicated when many sequences must be aligned. Extending the cost scheme for pairs of sequences to an all-pairs cost for aligning multiple sequences leads to a straightforward extension of the known dynamic programming algorithms. Unfortunately, the runtime of such an algorithm grows exponentially with the number of sequences to be aligned, which disqualifies it for most applications. Again, one has to rely on heuristics. One such approach is to compute all pairwise alignments and then assemble the alignment incrementally, starting with the most similar sequences. In each step we take a cluster of already multiply aligned sequences and align them with a simple modification of the pairwise sequence algorithm that allows for two multiple alignments to be aligned. Here, in effect, each sequence position does not present a unique nucleotide (in DNA/RNA), amino acid (in proteins), or gap character, but a profile of choices, derived from the respective multiple-alignment column. Aside from that scoring and algorithm are essentially the same as in pairwise sequence alignment. This version of multiple-sequence alignment is called 'progressive' alignment. ClustalW25,26 is a common multiple-alignment program that works this way.

Another approach to multiple alignment maps the common pattern of a sequence family to the parameters of a so-called hidden Markov model (HMM).27 An HMM can be trained to a given set of sequences. A new sequence can be aligned to the trained model. Doing this for a number of sequences, in turn, directly retrieves the multiple alignment -in linear time in the number of sequences. In essence, the process is tantamount to computing a set of pairwise alignments (sequenced against the HMM), where the scores depend on the sequence position. Such a scoring system is based on a 'position-specific scoring matrix' (PSSM). The training process learns the characteristics of each sequence position, and codes them into the parameters of the HMM.

T-COFFEE,28 DIALIGN,29,30 and MUSCLE31 are other popular multiple-alignment programs.

Sequence alignment methods provide the algorithmic basis for a great variety of sequence analysis methods that are used for, for example, the discovery of genes, the analysis of regulatory regions,32 or the construction of phylogenies. They can be applied to proteins to discern their evolutionary18,33,34 or structural relationships,35,36 or to classify them into families.37-39

3.15.3.2 Phylogeny Construction

A phylogeny describes the evolutionary relationships between molecular sequences using a treelike structure. In general, the 'leaves' are labeled with known sequences of some of the observed organisms. Edges denote ancestral relationships between the nodes of the tree, in the ideal case their length measuring the ancestral distance in terms of time. Interior nodes represent assumed common ancestors whose sequences are unknown. Figure 2a shows a phylogenetic tree of the myoglobin sequences used in Figure 1b. Phylogenetic trees come in undirected and directed versions. Most phylogenies are undirected, meaning that we cannot tell which of two interior tree nodes joined by an edge is the ancestor and which the descendant. Among many others, PHYLIP and PAUP* are two widely used software libraries for constructing phylogenies.

Penguin

Chicken

Turtle

Kangaroo

Turtle

Chicken

Mouse

Gorilla Human Chimpanzee

Penguin Chicken

Penguin Chicken

Kangaroo o

Mouse

Rabbit

Opossum0 Human, gorilla, chimpanzee^°Cattle

Figure 2 (a) A phylogenetic tree of the myoglobin sequences aligned in Figure 1b. (b) A split diagram of the myoglobin sequences aligned in Figure 1b.

Methods for inferring phylogenies from sequence data can de divided roughly into distance-based methods, maximum-parsimony methods, and maximum-likelihood methods.

The distance-based approach deduces the evolutionary distance from the pairwise similarity of the sequences, either measured by the score of a pairwise alignment of all involved sequences or by other techniques (e.g., based on the maximum-likelihood approach).22 The scores are arrayed in a distance matrix. Although, in many cases, these distances cannot be mapped accurately onto a tree, the tree generation is guided by this matrix. The tree that fits the distances best is hard to find - the problem is 'NP-hard.' Thus, we have to search for proper approximations of the optimal solution. Like others, the program ClustalW (see Section 3.15.3.1) constructs a multiple alignment and builds a phylogenetic tree bottom-up by successively clustering the most closely related sequences.

The maximum-parsimony approach is based on the reasonable assumption that evolution introduces as few changes as possible to generate the observed sequence data. In this model, every mismatch between two directly related sequences (connected by a single edge) incurs some costs. The generation of a tree then aims at minimizing these costs.

The maximum-likelihood approach is based on a probabilistic model of evolution. Here, the phylogeny with the highest probability is searched for, given the observed sequence data. This approach is much more compute-intensive than the parsimony-based approach, but has the advantage of being founded on a statistically more sound and generalizable theoretical basis.40

All three approaches have in common that they have rather limited abilities to model the highly complex mechanisms of evolution. Specifically, the hypothesis that the phylogeny can be modeled in a tree structure is not valid in many biological settings because, for example, gene duplications, horizontal gene transfer, or recombination events can take place. The split-decomposition method takes account of the fact that the distance measure derived from the sequence alignments need not be treelike.41 The method generates so-called split diagrams that are more treelike if the underlying distance measure allows for this, but which exhibit deviations from the tree shape where the data suggest. Figure 2b shows a split diagram of the myoglobin sequences from Figure 1b. There are plenty of articles elucidating and discussing the current state of the field of phylogeny construction.42-44

3.15.3.3 Finding Genes

The overtly most interesting parts of a genome sequence are the regions that code for proteins, our genes. Genes make up only 1.5% of the human genome. Figure 3 illustrates the complex structure of a eukaryotic gene. All parts of this structure have to be identified accurately.

There are two basic approaches to identifying genes: 'contents sensors' and 'signal sensors.'45 Contents sensors classify DNA segments into types such as 'coding' and 'noncoding.' The homology-based (or extrinsic) content sensor does this with a comparison to known sequences using sequence alignment. With the availability of a rising number of genomes, one can search for regions that are more conserved than others, and thus may be more likely to a have a

Regulatory region

Enhancer

Promoter

Pre-mRNA

Figure 3 The structure of a eukaryotic gene. Only the shaded parts represent coding regions.

Poly-A signal

5' noncoding 3' noncoding

Figure 3 The structure of a eukaryotic gene. Only the shaded parts represent coding regions.

function or be a coding region.46 Attempts to even capture the evolutionary development of genes have been made.47 Without the knowledge of related sequences, the genes have to be identified with ab initio (intrinsic) approaches. Here, the target sequence is scanned for specific patterns such as start codons, CpG islands, or unusual codon frequencies. With these patterns, either statistical classifiers,48'49 HMMs,50'51 or neural networks52 can be used to uncover the structure of a gene.

Using the best of these methods, the accuracy for predicting coding nucleotides in mammalian sequences reaches 93%.53 The detection of exons is even better, but the prediction of their correct boundaries is much more difficult, and as the exons become shorter the situation gets worse. The accuracy at the gene prediction level is even lower - below 80%. With increasing length of introns, the number of exons that form a gene, and the length of the noncoding regions, the accuracy of gene prediction drops further - below 50%.54 Obviously, there is still room for improvement.

3.15.3.4 Analyzing Regulatory Regions

Regulatory regions are regions in the genome that do not code for proteins but instead control the expression of other, coding, regions. Regular regions occur in the vicinity of coding regions. The promoter regions, located upstream of the genes (Figure 4), are the most prominent type. They are usually 10-15 nucleotides long. They harbor binding sites for the transcription factors (TFs), proteins that initiate DNA transcription. Other regulatory regions include binding sites for coactivators, coexpressors, or factors that regulate epigenetic modifications of the DNA.55 They span only a small stretch of DNA (10-15 nucleotides) but hold certain motifs.

Methods for finding and delineating regulatory regions suffer from the same drawbacks as gene-finding strategies. The situation is even worse here: because regulatory regions are not expressed, there are no negative images such as the mRNA for the genes that could imply their boundaries. Similar to the gene-finding problem, intrinsic methods are applicable. Additionally, the overrepresentation of certain motifs in the regulatory region of genes that are known to be regulated by the same transcription factors provides an opportunity to use comparative methods.56,57 Roughly speaking, most of the methods currently used in detecting such motifs are either alignment based or consensus based. The consensus-based approaches usually enumerate all possible motifs of a given length, and rank their frequencies in the given set of regulatory sequences. Current tools have been estimated to find about half of the existing promoters, but their specificity is still low, implying a significant number of false positives.58

Enhancers, silencers, and matrix-attached regions are other types of regulatory regions for which predictions have been attempted with similar technology. Compared with protein-based bioinformatics, the analysis of regulatory regions is still at an early stage of development. It is especially difficult because the processes of molecular recognition based on proteins binding to DNA are very intricate and cannot yet be analyzed at a structural level. Thus, we have to derive all information from the rather indirect methods of sequence analysis and comparison.

3.15.3.5 Finding Repetitive Elements

As mentioned in Section 3.15.2, repeats in the DNA sequence are a stumbling block for genomic sequence assembly. Despite the fact that a significant part of the human genome consists of repeats, much of their origin and function is still uncertain. (There are theories that many of these regions are the result of endogenous viruses and selfish genes.59) It is assumed that up to 50% of the human genome has a repetitive nature.60

There are numerous tools that detect and mark repeats in strings; some target biomolecular sequences explicitly. When they are used as a first step in genome analysis, the detected repeats are usually masked (i.e., excluded from

r"

it V

24 27

rir^

35 34 45 V

34 45

^CCVICU^MK

E23_5

* fl i .AUWSAiftAACACCA

C" £ \ " UUWCCOUOCU 0 A<JKMAAJOCC«i AW«*««^

20 2 48

Xi c"trC

15 15

13 7

12 12

A A AU AVAIII

Figure 4 Secondary structure of the RNA from the small ribosomal subunit of Tetrahymena bergeri. The fat lines denote pseudoknots.

further analysis). A cautious use of automation is still vital, because repeats can lie in functional regions. RepeatMasker and similar tools scan the target sequence for exact or approximate matches to patterns in a dictionary of known repeats. RepeatMasker scores the similarity between match and pattern, and uses a heuristically determined threshold to tell repeats apart from unique fragments. This technique can only rediscover known repeat patterns, and the use of threshold bears the risk of false predictions.

The task of finding repeats in a sequence is much harder when there is no supporting information or prior knowledge (e.g., the length of the repeat, or motifs). Tools with this ability usually search for exact repeats first, and then for so-called degenerated repeats (slightly changed), using the exact repeats as query sequences. A popular algorithmic technique (used by REPuter61 and MUMmer, for example)62 organizes the data in a structure called a suffix tree, allowing fast detection of exact repeats.

Some programs specialize in finding an important type of repeat - the tandem repeat.63 This repeat does not allow for insertions between the original sequence and its copy, and is thus easy to find. The visualization of the detected repeats and their relation is an important feature for the functional study of repeats.

Repeat-finding programs can be used for a multitude of tasks:

• Checking genome assemblies. Assembly programs (see Section 3.15.2) are imperfect, and thus the sequence assemblies they produce may contain errors. Repeat-finding programs have detected palindromic repeats in the human genome sequence that were due to wrong assembly.

• Identifying low copy repeats in human diseases. Several human diseases are associated with deletions or duplications of specific genomic regions. Such repeat patterns can be identified by a repeat-finding program.

• Checking the uniqueness of hybridization probes. This application is technological. On a microarray we want to deposit sequence probes that are unique in the genome under consideration. This uniqueness can be ensured with a repeat finder.

• Comparative genomics. Concatenating two different genomes and looking for repeats with varying rates of divergence allows comparison of the two genomes at different grades of similarity.

3.15.3.6 Analyzing Genome Rearrangements

In addition to mutations and recombinations that shape a species, there are also large-scale rearrangement operations in genomes that largely act at a chromosomal level and lead to differences between species. As long as only one chromosome is affected, these are translocation (integrating some sequence fragment at another location) and reversal (the same as translocation, but with integration in the oppositely directed DNA strand). Fission and fusion operations enable the interchange of segment fragments between different chromosomes.

Genome rearrangements can be analyzed by comparing two or more genomes computationally. The knowledge of correspondences between parts in both genomes can help reveal the time sequence of rearrangement operations. This scenario is similar to that faced by sequence alignment algorithms (see Section 3.15.3.1), with the difference that the mutation, insertion, and deletion sequence operations have been changed to genome rearrangement operations. Again analogously, some costs are attributed to the genome rearrangement operations. Then, the sequence of operations with the lowest cost that transforms one genome sequence into the other is searched for. The respective algorithms make the assumption that rearrangements occur at the level of large conserved blocks.64,65 This assumption may not be justified, however: recent comparative investigations between the genomes of the human and mouse have revealed that genome rearrangements occur on different scales. Intra-chromosomal micro-rearrangements affect sequence segments of less than 1 Mb in length; macro-rearrangements affect much longer sequence segments, and can occur inside a chromosome as well as between chromosomes.64 The rearrangement problem is algorithmically more complex than that of sequence alignment.65,66

As the number of available mammalian genomes has increased rapidly in recent years, the idea to infer phylogenies from their pairwise rearrangement history has gained popularity.67,68 Additionally, useful insights into the relationship between genome rearrangement and disease can be gained.69

Was this article helpful?

0 0
Lower Your Cholesterol In Just 33 Days

Lower Your Cholesterol In Just 33 Days

Discover secrets, myths, truths, lies and strategies for dealing effectively with cholesterol, now and forever! Uncover techniques, remedies and alternative for lowering your cholesterol quickly and significantly in just ONE MONTH! Find insights into the screenings, meanings and numbers involved in lowering cholesterol and the implications, consideration it has for your lifestyle and future!

Get My Free Ebook


Post a comment