Basic Protocol Using The Neighbor Program From The Phylip Package To Construct A Phylogenetic Tree

This protocol describes the use of NEIGHBOR (see Fig. 6.3.1), included in the PHYLIP 3.6 package, which is distributed by Joe Felsenstein (University of Washington) and is one of the most widely used software packages in phylogeny studies. NEIGHBOR is the PHYLIP implementation of Neighbor Joining (Saitou and Nei, 1987). Distance estimation is performed using DNADIST or PROTDIST (Support Protocols 1 and 2). To accomplish the bootstrap procedure, first resample the sites using SEQBOOT (Support Protocol 3), then apply DNADIST or PROTDIST, run NEIGHBOR, and extract the bootstrap tree using CONSENSE (Support Protocol 3). Finally, the resulting tree can be drawn using a program such as TreeView (UNIT 6.2) or NJplot (Perriere and Gouy, 1996).

Necessary Resources

Hardware

PHYLIP executables are available for pre-386 DOS, 386/486/Pentium DOS, Windows 3.1, Windows 95/98/NT, 68k Macintosh, or PowerMac. The PHYLIP C source code is also available for Unix, Linux, or VMS systems.

Software

PHYLIP is available for free from

http://evolution.Qenetics.washinQton.edu/phylip. html. The package contains C source codes, documentation files, and a number of different types of executables. Its Web page contains information on PHYLIP and ways to transfer the executables, source code, and documentation. The documentation is remarkably clear and complete, and provides a number of useful references.

Files

NEIGHBOR requires a distance matrix (or a set of distances matrices when the bootstrap procedure is used), which is estimated by DNADIST (Support Protocol 1) or PROTDIST (Support Protocol 2) from a multiple sequence alignment. The file contains a number of taxa on its first line. Each taxon starts a new line with the taxon name, followed by the distance to the other taxa, and there is a new line after every nine distances. Taxon names have ten characters and must be blank-filled to be of that length. The default matrix format is square (Fig. 6.3.2) with zero distances on the diagonal. In the case of multiple matrices, as obtained with the bootstrap, matrices are given in the same format one after the other, without omitting the number of taxa at the beginning of each new matrix.

1. Download and install PHYLIP according to the program documentation (see Necessary Resources, above).

2. Generate a distance matrix for the multiple sequence alignment of interest by running either DNADIST (for DNA sequence alignments; see Support Protocol 1) or PROTDIST (for protein sequence alignments; see Support Protocol 2).

3. Begin a NEIGHBOR session in PHYLIP by double clicking on its icon.

4. At the prompt, enter the distance matrix file name and the name for the outfile, which will contain a simple representation of the output tree. The default files are infile and outfile, respectively, but the authors strongly recommend redefining these files to avoid possible confusions or deleting previously computed files.

When a file called infile already exists in the PHYLIP directory, NEIGHBOR does not ask for the input file and reads the existing infile. Similarly, the option of renaming the output is only given if a file called outfile already exists. If no such file exists, NEIGHBOR automatically writes the output to a file called outfile.

5. Once done, the user has to select among numerous options (see Fig. 6.3.3), which, a priori, have to be used with their default values, except M in the case of the bootstrap procedure. When options have been determined, type "Y" to run NEIGHBOR.

These options are as follows. N defines the method to be used; NJ (default option) has to be preferred over UPGMA, which assumes a molecular clock. O makes it possible to specify which species is to be used to root the tree; when O is on, the user is asked for the rank of the outgroup species in the input (matrix) file, otherwise the default outgroup species is the first; this outgroup (rooting) species is used in the tree printed in the outfile. L and R have to be switched on when the matrix is not square but lower-triangular and upper-triangular, respectively. S has to be on when the data contain subreplicates; it allows NEIGHBOR to read the input data, but the number of replicates is ignored. J enables one to choose a random order of species; the user is then asked for a "seed"; however, NEIGHBOR is almost insensitive to species ordering. M has to be used in the case of the bootstrap procedure (Support Protocol 3) to provide the number of pseudo-matrices. 0 defines the terminal type; this may affect the ability of the programs to display their menus and results, but the "none" option is usually satisfying. The 1 and 2 options are used to check the data and the progress of run; the authors suggest switching them off, notably for large trees and bootstrap studies. When 3 is Yes (default value), the tree or trees are printed in the outfile; this is useful to quickly visualize trees with moderate numbers of taxa, in case of unique data set. When 4 is Yes (default value), the trees are written in Newick format in the outtree file, and can then be drawn using TreeView (UNIT 6.2) or, in case of multiple data sets, combined by CONSENSE to obtain the bootstrap tree (Support Protocol 3). To change the default values, simply type the option character. For example, typing 2 changes the progress of run status from Yes to No, and typing 2 again returns one to Yes.

6. Finally, NEIGHBOR asks for the outtree file, which will contain the tree in Newick format (UNIT 6.2). The resulting tree can be visualized in the outfile, but a better view is obtained by applying TreeView (UNIT 6.2) to the outtree file.

The option of renaming the outtree file is only given if a file called outtree already exists. If no such file exists, NEIGHBOR automatically writes the output to a file called outtree, which may be a source of confusion. Inferred trees are unrooted and written in Newick format (UNIT 6.2). For example, the BIONJ tree in Figure 6.3.4 is made of three subtrees, containing (Candida_tr, Candida_al, and Saccharomy), (Taphrina_d and Protomyces) and (Athelia_bo, Spongipell, and Filobasidi), respectively, as can be shown from its TreeView representation (Fig. 6.3.5; see UNIT 6.2 for discussion of TreeView and Newick). Each subtree is made up of two subtrees or taxa; the numbers in Figure 6.3.4 indicate the branch lengths. Both trees in Figure 6.3.4 have identical topologies (even when the way they are encoded in Newick format looks quite different) but (slightly) different branch lengths.

Applying NEIGHBOR to the matrix of Figure 6.3.2, one obtains in the outfile the tree shown in Figure 6.3.6, while in the outtree file we have the second tree from Figure 6.3.4, in Newick format. This tree is equivalent to that of Figure 6.3.5.

7. To assess the tree quality, bootstrap the tree according to Support Protocol 3.

From Current Protocols in Bioinformatics Online Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.

CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 6 INFERRING EVOLUTIONARY RELATIONSHIPS UNIT 6.3 Getting a Tree Fast: Neighbor Joining and Distance-Based Methods

SUPPORT PROTOCOL 1: DISTANCE MATRIX ESTIMATION FROM DNA (OR RNA) SEQUENCES USING DNADIST

SUPPORT PROTOCOL 1: DISTANCE MATRIX ESTIMATION FROM DNA (OR RNA) SEQUENCES USING DNADIST

Distance estimation is the first step in reconstructing a phylogenetic tree using a distance-based method. DNADIST, from the PHYLIP package, estimates the pairwise evolutionary distances between nucleotide sequences under various models of nucleotide substitutions. These models account for hidden substitutions and incorporate knowledge about the mutation process. Distance estimation is based on the maximum-likelihood principle (Swofford et al., 1996). The model choice is sensitive and influences the distance values, and then the tree to be constructed. DNADIST reads a multiple sequence alignment and outputs a distance matrix. When the bootstrap procedure is used, the input file contains the pseudo-alignments one after the other, and the output file contains the corresponding pseudo-matrices in the same order.

Necessary Resources

Hardware

PHYLIP executables are available for pre-386 DOS, 386/486/Pentium DOS, Windows 3.1, Windows 95/98/NT, 68k Macintosh, or PowerMac. The PHYLIP C source code is also available for Unix, Linux, or VMS systems.

Software

DNADIST is part of the PHYLIP package. PHYLIP is available for free from http://evolution.penetics.washinpton.edu/phvlip.html. The package contains C source codes, documentation files, and a number of different types of executables. Its Web page contains information on PHYLIP and ways to transfer the executables, source code, and documentation. The documentation is remarkably clear and complete, and provides a number of useful references.

Files

DNADIST requires DNA multiple sequence alignments in PHYLIP format, as obtained from alignment programs such as ClustalX (UNIT

2.3). The first line contains the number of taxa and sites; next come the taxon data with a new line per taxon. Taxon names have ten characters and must be blank-filled to be of that length. The taxon names are followed by the sequences, which must either be "interleaved" or "sequential" (Figs. 6.3.7 and 6.3.8). The sequences can have internal blanks in the sequence but there must be no extra blanks at the end of the terminated line. The three symbols N, X and ? indicate an unknown nucleotide while a dash (-) indicates a deletion. In the case of multiple data sets, as provided by SEQBOOT, pseudo-alignments are given in the same format one after the other, without omitting the number of taxa and the number of sites at the beginning of each new set.

1. Download and install the PHYLIP package and, initialize a DNADIST session by double clicking on its icon.

2. At the prompt, enter the sequence alignment file name and the name for the output, which will contain the distance matrix. The default files are infile and outfile, respectively, but the authors strongly recommend redefining these files to avoid possible confusion, or deletion of previously computed files.

When a file called infile already exists in the PHYLIP directory, DNADIST does not ask for the input file and reads the existing infile. Similarly, the option of renaming the output is only given if a file called outfile already exists. If no such file exists, DNADIST automatically writes the output to a file called outfile.

3. Then the menu of Figure 6.3.9 appears, which asks for important and sensitive choices.

The remaining steps of this protocol primarily describe options requiring in-depth explanations or where the default values have often to be changed. More details are given in the DNADIST documentation. To change the default values, simply type the option character. For example, typing "I" changes the sequence format from interleaved to sequential, and typing "I" again returns to the interleaved format.

Set the parameters

4. D defines the substitution model. All models assume that sites evolve independently. The four available models are nested, i.e., Jukes-Cantor is a special case of Kimura, which is a special case of F84, which is a special case of LogDet. Jukes-Cantor (Jukes and Cantor, 1969) assumes only one substitution rate, Kimura (Kimura, 1980) allows for a difference between transition and transversion rates, while F84 (Kishino and Hasegawa, 1989; Felsenstein and Churchill, 1996) is similar to

Kimura but allows for different frequencies of the four nucleotides, and LogDet does not impose any restriction on the 16 rates (except those induced by the Markovian nature of the process). So LogDet (Steel, 1994) is the most flexible model, but is often overparametrized, unless the sequences are very long (say >3000). F84 (the default option) is a good compromise, notably when the base frequencies are not equal. When they are almost equal, Kimura is a good choice, while Jukes-Cantor is overly simple in most cases.

Note that all sites (informative or not) must be given to DNADIST for these models to be used in the correct way.

5. G asks whether or not the substitution rates vary across sites. Biologically speaking, the answer is clearly yes. It has been demonstrated that the Gamma distribution (Swofford et al., 1996), which is defined by a parameter usually denoted as a, is a good model to account for this variability. a was estimated between 0.05 and 1.0 for numerous data sets (Yang, 1996), which indicates that rates strongly vary across sites (variability increases as a decreases). However, the default option of DNADIST is to not correct for this variability (i.e., a = ¥), which is a common practice.

Jin and Nei (1990) recommend using a = 1.0 or 2.0. The authors of this unit have recently demonstrated (Guindon and Gascuel, 2002) that uncorrected distances are often better suited, especially when the molecular clock is more or less satisfied. Therefore, a pragmatic approach is to use the default option, and to check whether or not using a reasonable value (e.g., 1.0 or 2.0) for a changes the result. A software program to estimate the most appropriate value of a is also available via the authors' Web page (http://www.lirmm.fr/~w3ifa/MAAS/).

However, DNADIST does not use the standard a parameter, but rather the "coefficient of variation" (CV), which is equal to 1/a2. One obtains CV = 4.0, 1.0, and 0.25, when a = 0.5, 1.0 and 2.0, respectively. Moreover, the LogDet model cannot be combined with the gamma correction.

6. T asks for the transition/transversion ratio. The default value is 2.0, and there is no way to estimate this value within PHYLIP.

Hopefully, the results are not very sensitive to the value of this parameter (unless it is extreme). It is possible to estimate it using simple formulas from Kimura (1980).

7. C allows user-defined categories, for example to specify that third-position bases have a different rate than first and second positions. This option allows the user to make up to 9 categories of sites, but, as for the LogDet model, using too many categories can make the model overparametrized. The user is asked for the relative rates within each category. The assignment of rates to sites is then made by reading a file whose default name is "categories."

An example and more details are given in the DNADIST documentation. There is no program from PHYLIP for estimating the different rates, but just as for the above ratio these parameters are not very sensitive (unless extreme).

8. W allows to select subsets of sites. Basically it has to remain "No" (the default value), unless the user wants to check the influence of various categories of sites.

See DNADIST documentation for more details.

9. F must remain as Yes in any practical situation.

10. L defines the matrix format, square (default value) or lower-triangular.

11. M has to be used in the bootstrap procedure (see Support Protocol 3). The user is then asked for the number of pseudo-alignments in the input file. Otherwise the default value (No) is required.

12. I defines the multiple sequence alignment format, which is interleaved or sequential (Fig. 6.3.7 and 6.3.8, respectively).

13. Once all options have been determined, type "Y" to compute the distance matrix.

With the working example of Figure 6.3.7 and all default values, DNADIST returns the matrix of Figure 6.3.2.

From Current Protocols in Bioinformatics Online Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.

CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 6 INFERRING EVOLUTIONARY RELATIONSHIPS UNIT 6.3 Getting a Tree Fast: Neighbor Joining and Distance-Based Methods

SUPPORT PROTOCOL 2: DISTANCE MATRIX ESTIMATION FROM PROTEINS USING PROTDIST

Was this article helpful?

0 0
Natural Remedy For Yeast Infections

Natural Remedy For Yeast Infections

If you have ever had to put up with the misery of having a yeast infection, you will undoubtedly know just how much of a ‘bummer’ it is.

Get My Free Ebook


Post a comment