Blocks

Block Maker finds conserved blocks in a group si two or more unaligned protein sequences, which are assumed Co be related, using two different algorithms. At least two protein sequences must be provided to make blocks. Each sequence must have a unique name of 10 characters or Iess. If you have the accession nuiinbers of some sequences you would like to use. Batch Fntre? can create a file for you In FASTA tormaf.

^nfcer your email address ¡Í you want the results through email:

Enter a short description of your group of sequences:

Enter the name of a file containing your protein sequences:

Enter your protein sequences in a single format (e.g. FASTA):

Browse.

Enter the name of a file containing your protein sequences:

Enter your protein sequences in a single format (e.g. FASTA):

Figure 2.2.14 The Block Maker input form obtained by clicking the Block Maker link in Figure 2.2.13 with the full-length sequences for the subclade inserted.

Figure 2.2.14 The Block Maker input form obtained by clicking the Block Maker link in Figure 2.2.13 with the full-length sequences for the subclade inserted.

From Current Protocols in Bioinformatics Online Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.

CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 2 RECOGNIZING FUNCTIONAL DOMAINS UNIT 2.2 Using the Blocks Database to Recognize Functional Domains FIGURE(S)

Figure 2.2.15 The Block Maker result from the subclade selected in Figure 2.2.13 plus the corrected Dnmt2 sequence (Fig.

Printing images is not supported by this browser. To print images, select update and download the latest version of your browser.

Netscape-

File Edit View GO CQrrtmuflieatOr

O X Help iT SoDkmarKs ^ Lorahpn JfiJ.f AoWj<H }a^pfo bocoIj/bak/Pto ticolj_2. X 14. ht»l

Back field a id

Home Seencti Escape

Pnrat Security Shgp

BLOCKS from MOTIF

Logos; rFostficviiJl irPPyirGIKl fflhont l.oyosl Tree; rOata) [JfflltfflAu] ¡Postscript 1 [^t] [£X£] rWewickl Search: [LfiHAl rAbout Lfinal [HASTl rAbout HA5T1 PlLucm s: rcoDEHQPl fAbout codehopI Suhsbitut ions: [5j£ij fflhoufc sift or> blocks]

f ftJwut Trees 1

**HLQOB from MQT3F**

>Dnmt2_ DHr. methyl transferase homo log pmtl

Dnmt2

Bp 10147171 MHfc_.HU sp|O5S055 |DNI12_M0 spjP40999|PKTl_SG tr|0352L2

tnoness

Onmt2j__B

Dnab 2

sp|Ol4717 | DNM2_HLT Sp|O5S05S|DHM2_MD £p|P4D999|PMT1 SC

tr|03S2l£

Dnmt 2_C

Drmt*

SplOldll?|DHM2_HU £p|O5S0S5|DNM2~MC> spIP4D999IPMT1 SC tr1035212 tr1043669

Dnmt 2_D

Ep|Ol47l7|DHH2_HU sp|O5505S|DNN2_M0 sp|P40999|PKTl_SG tr|0352l2

width - 46

3 FRVLEL FSGICSC1H YAFN ¥AQLDGQIVAALDVNTVAHAVYAHH YG £

4 mVLEI iSGVCGMH HALP ESC I PfttfWM j DVJiTVANEVYK YN FPH 4 LRVLEL Y3G 3CGHH HftLP ESHIP AH WiiA I DVJlT^AWEVYK HH FPH 7 LRVLELYSGIGCMHYALNLAN I PAD IVCAIDIN PQAN EI YN LNKGK 4 LRVLEL YSGIGGHHHALR EE HI P AH WAAIDVNTVANEVYKHN FPH H LRVLEL YSGVGGMH HALR ES CIP AQW AAI DVNTVANEVYKYN FPK

Dnmt2_ DnratS

sp|Ol7|DHH2_HU splOSSOSS|DHH2_M0

width

width

100 n

- 38 141 143 143 145 143 143

DALTHLCGLIF ECQELEY1LI-1EN VKGFES SQAFNQ FI ES LEF PG f S FLHILDILPRLQKL PK YILLEN VKGFEVS STPDLLIQTIENGG F 3 FLYILDILPRLQKLPK YILLEN VKGFEVS STPGLL TQTIEACGF AFLTfl ILNVLPHVN NL PEYILI EN VQGFE ESKAAEECRKVLRNCG Y % FLY ILDILPRLQKL PK YILLEH VKGFEVS STRGLLIQTI EACG F %FLHILDILpRLQKLPK YILLENVKOFEVS STP DLLIQT1EWCG F

HWP EFILT PTQ FN VPN TR YP.Y YCIARKGE D QYQEFLLS PTSLGIPN SRLRY FLIAKLQE E QYQE FLL£ PS SL01PNSRLR YSLIAKLQS E ML 1EO T LS PtJQ FN I PN SRS RWYCLARLIIFK OiQEFLLS PS SLGIPNSRLP Y FLIAKLQSE QYQE F : LS PTSLG I PNSFLF Y FL IAK LQS E

VPDDVLTKPVLVHD' 11M PAQSR3MC FTKOV TH UTE&TGSAYTPL5 EDES H LPPKSLLR VALLLDIVQPTCRPSVC FTKGVGS VIBGTGS VLQTAEDVQV E L PPKLLLft YALLLDIVK FTSPRS HC FTKGYGS11E^TGSVLQAAEDAQIE VLE5VLNKWGH2FDIVKPDSS3CCCFTKGYTHLVQGAGSILOHSDHENTH L PPKLLLR YALLLDIVK PTSRRSHCFTKGYGS YIEGTGE VLQAAEDAQIE L PPKSLLRYALLLDIVQPTCRRSVC FTKGYGE YIEGTGS VLQTAEDVQVE

e PEN F EFP PETTNRQKYRLLGNS IHVKWQ EL 1KLL F PPEFG FP EK ITVKQRYRLLGNSLWVHWAKL IK IL F PPEFG FP EK TTVKQRYRLLGWSLNVHWAKLLTVL

Current Protocols Library

sp|01fl"7i7 IDNM2 HU OS505S1DHM2 MO |PMT1 SC

354 354

F PFEPG FP EK ITVKQRYPLLGNSLW VH WAHL IK IL F PPEFG FP EK TTVKQRYELLGNSLHVH WAKLLTVL P ESLEWSKS N VTEKCMYRLIJGNS IN VK WS YLISLL F PPEFGFP EK TTVKQRKKLLGNSLN'VH'WAKLLTVL F PPEFGFP EK ITVKQfclKLLGNS LWVHWAKL IK IL

a1 100%

1 & Mli dP Li) ■-.¿■j1

Figure 2.2.15 The Block Maker result from the subclade selected in Figure 2.2.13 plus the corrected Dnmt2 sequence (Fig. 2.2.11) using the MOTIF motif finder.

From Current Protocols in Bioinformatics Online Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.

CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 2 RECOGNIZING FUNCTIONAL DOMAINS UNIT 2.2 Using the Blocks Database to Recognize Functional Domains FIGURE(S)

Figure 2.2.16 The Block Maker result from the subclade selected in Figure 2.2.13 plus the corrected Dnmt2 sequence (Fig.

Printing images is not supported by this browser. To print images, select update and download the latest version of your browser.

Netscape-

File Edit View GO CQrrtmiiflieatOr

Help iT SoDkmarKs ^ Lorahpn JfiJ.t AoWj<H }a^pfo bocoIj/bak/Pto . X 15. ht»l

Back field a id

Home Seencti Netscape

Pnrvt Security Shgp

BLOCKS from GIBBS

Logos; rpostficripl irppyirGIKl fflhput i.oyosl Tree; rOata) [JfflltfflAu] ¡Postscript 1 [^t] [£X£] rWewickl Search: [LfiHAl rAbout Lfinal [HASTl rAbout HA5T1 PlLucm s: rcoDEHQPl fAbout codehopI Substitutions: rsiFTl r About 31 FT Oft Blockel

fkbout Trees 1

>Dnmt2_ HHA methyl transferase homolog pmtl fSplMnSpol) (Mr. , r family

<> sequences are Included in 5 blocks

Dnnit2_A,

Droit?

Bp 10147171DHH2_HU sp IO55055IDNM2_M0 ■pjP40999|PKTl_SC tr|035212

tnones^

width - 46

3 FE VLEL FSG IGGlllH YAFN YAQLDGQIVAALDVNTVAIfAVYAHH '¿G £

4 LPVLEL iSGVCGMH H ALP ESC I PfttfWM I DVNTVANm K YN FPH 4 LRVLEL Y 3G 1CGMH HALP ES HI PAHWiiA I DVJlT^AWEVYK HH FPH 7 LPVLELYSG3GCMHYALNLANI PADIVGAIDINPQANEIYNLNKGK 4 LEVLEL YSG3CCHHHALR EE HI PAHWAAIDVNTVANEVYKHN FPH 4 LRVLEL SfSGVGGMH H ALR ES CIPAQWAAI DVNTVANEVYKYN FPK

DnmtZ__.B,

width

- 2*

Dnrat2 (

5>

54

RN IOSLS VKEVtK LQ ANMLLMS PP0OPHT

sp|Ol4717 | DNM2 HU (

5)

55

KTIEGITLEEFDP LS FDMILHS PPCQPPT

spIO55055IDHM2 MO f

5>

55

KTIH; ISL EDFDKL S FHMILM5 PPCQPFT

£p| Pi 0599|PHT1 SC (

A)

57

Mil r STLTAKDFDAFDCKLWTMS PSCQPFT

tr|03S2l£ [

5)

5S

KTI EG ISL EDFDK L £i FHM1LMS PPOQPFT

t

B)

55

ETIE>?ITL EEP0P LS FDMILMS PPC0PPT

Dnmt2_Cr

width

- 48

DlfLlBti (

10)

93

KRSDALTHLCGLIP EO^ELEYILMEWVKGFE5EQARNQFIES LEK PC F

Sp|Ol47l7IDHM2 HV (

n>

95

ETN S FLH ILDILPELQKL PEi Y 3 LLEH VKGFEVS S TPDLLIQT 3 EilOG F

£p|O5S05S|GMM2"MC) (

ID

ETTS FLY ILDILPPLQKL PKYJLLEN VKGFEVB STPGLLIQTIEAOGF

sp|P4D999IPHT1 SC (

it)

ES^AFLM X LtJ VL PH VN ML PBVILIBNVQGFE E3K AABECPK VLPNCC Y

tr|035212 f

it)

95

RTTS FLY I LD ILPPLOKL PK YI LL6tlW;GF£VS S TPCLL IQT IEACC f

tr1043669 (

ti >

95

PTN 3 FLHILDILPRLQKL PK YILLEWVKGFEVS S TP DLLIQT 3EHCC F

Dnmt2_D,

width

- 30

Drnnt?" [

Ci

141

HWR EF 3LTPTQ FN VPNTP YRY YCIARKGSD

sp1014 717 | DNM2 _HU (

c>

QEFLLSPTSLCIPN SELP YFLIAKLQSB

sp|05505S|DNM2 M0 (

0)

143

oYQEFLLS PSSLGIPN5HLP Y$ LIAK LQS B

sp I P4Q999IPKTl SC t

0)

145

HL r EG 3 LS PNQ FN IPWSRS FWYG LAP LN FK

tr|0352i2 (

c>

143

QYQEFLL5 PS SLGIPN SELRY FLIAKLQS E

tr|043669 (

o>

143

QYQEF 3 LS PTSLGIPN SELRY FLIAKLQS E

Dnmt2__B,

width

- 42

Dnmt.2 (

130)

301

ViiPLMS FP EN F EFPP ErTliPaK^ PLLGNSIH VKVWEL1KLL

sp |014717|DNM2 HU f

175)

348

I.-J-JLL jFPPEFGFPEK ITVKQE^iPLLGH3LUVHWAKLIK IL

sp|OSS05S|DNM2_MD (

— , . ■ n JI A A A i"i. i nuim ■ j

J A.A 1

348

IANLQG FP PEFG FP EEtTT VKQEVRLLGME LN VH WAKLLTVL

sp|Ö14"717 IDNM2 HV £p I OS 50 5 S1DHM2 HO IPMT1 SC

343 343

IAHLLGFPPEFGFPEK ITVKQR*PLLGHSLHVH WAKLIK IL IANLQG FP PEFGFP EKTTVKQR^PLLGNS LN VH WAKLLTVL ARLTfc FPE5 LEWSK5NVT EK CM KPLLGNS IN VKWS YLIS LL IANLQGFP PEFÜFP EK TTVKQR MPLLGNS LN VH WAKLLTVL X AK LLOFP PEFOFPEEi ITVKQH ¥PLLGNS LN VH W1KLIKIL

^ 1

I \Li Li)

Figure 2.2.16 The Block Maker result from the subclade selected in Figure 2.2.13 plus the corrected Dnmt2 sequence (Fig. 2.2.11) using the Gibbs motif finder.

From Current Protocols in Bioinformatics Online Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.

CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 2 RECOGNIZING FUNCTIONAL DOMAINS UNIT 2.2 Using the Blocks Database to Recognize Functional Domains FIGURE(S)

Figure 2.2.17 The CODEHOP input form obtained by clicking the CODEHOP link in Figure 2.2.16 with the Block Maker Gibbs blocks inserted.

Printing images is not supported by this browser. To print images, select update and download the latest version of your browser.

Netscape: CODEHOP

File Edit View Go

Communicator

Help

jf* Bookmarks .¿t Location: "file ./how/jot ja/piota coli/taky&iötoc d1s_£. 3.16. html

BacV Bor Reload

Home Search Netscape

Print Security Sfiop

Stop

CODEHOP:

COnsetisus-DEqenerate Hybrid Oligonucleotide Primers

. Celt-nq^vred

■ The COPEKOFjKwilhrft TMCDOFHOPlWrtHtfiflt

■■ The input ifwuld btiul of local multiple alignments (blocks) of a group effected pcolen sequences. Ihe alignment nnH-t be in Sochi Database format, such as in Btock Idffler output.

■ ungspped parts of CWitaJ- or FA3TA-fwnft4it6dflWHiJ rn^njit »tjfi lse 4UDofnQlk9«r turned iMto tuocfcs ty IK 6t>e*) rrwftde aiawrem procesao' you can atoo msnmiir reformat mdlfcle sequence s^menls with the jfrtgUP

■ Tht Output irf ill thrji prOQriflVs tOntSirti inks UiM iiftd tht riUtrVJ Wotks Ifl ttw

■ If your ¿equtnees Hiqn glohsi^1, you ^ il geI bal ter multiple Mgnmtnt nesulls from Clua-tal than from Ills molif lifrfics y-std hv Bb; k Mjfctr.

• Blocks va pfoctiita sequence ^eigMt (thtriomiwri 1pfcwing each sequence kchiwiI) To mfcaaan?nia:?r 3in ihe btock(s)mariuiiy adjwt tut eequerci weights Increase IKue nuntier to gwe a sequence mote weight.

Paiti y«r btacki)) beiow logic for pnmeia | Beset & Clear |

Core tdegeneMie 3' ft g»on J

Owwa -desetifrsti 5' report)

früwr cortwniraiior> (rtdH, cfcfoit'SfriMl:

- degereiacv fdefatfM2S|;

- ttmwr-nwt latikifeacr

-pcly-noc

|iT7

Genetic cod»

AchokpliiamHdwa AiTOtjitttiium tumtfaatrii Bacilus jubldis

CwfonuHflf table (acta! for mora thwn^ Bacikl' By (It foul:, us to 3 Of (hi Ifut degentrite prrner& n an overiapprrq «tine shown.

flhAv Ihp III rbw-n^pihl* r.r ■ h,-,'.v iM rArFrhmrui liriiWpt J

http://www.mrw2.interscience.wiley.com/cponline/...d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (1 / 2) [2002-12-19 21:06:01]

W atTiun. iX' tc J pi int orqiiwrMit prrws n an m«flm«a mi we vwvn. Show (he pi KMt deflirttriKt., ftf stWf HI dvtrlippirtfl pfimtri. ^

By dtTsull, Uk 3'Ua« &T1ht piiiwr rouil bt in «ivirtonl povlco. rt(p«s«i of Ihe ewe )l»c(nMS kltrvj UM cwt jinclrit» to l^e I»M J

Fores Iht f ?rf/cl»mji / to bt a codon fccundary. J Um itie nwtl conwvoi cwJoru n llw chn>. j

f

100%

4 & Ui m C3 Z

Figure 2.2.17 The CODEHOP input form obtained by clicking the CODEHOP link in Figure 2.2.16 with the Block Maker Gibbs blocks inserted.

From Current Protocols in Bioinformatics Online Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.

CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 2 RECOGNIZING FUNCTIONAL DOMAINS UNIT 2.2 Using the Blocks Database to Recognize Functional Domains FIGURE(S)

Figure 2.2.18 Part of the CODEHOP result showing suggested PCR primers with maximum degeneracy set to 32.

Printing images is not supported by this browser. To print images, select update and download the latest version of your browser.

File Edit View Go Commuiic-ator

Help

t" BooKmarK.* Loc-SSdr fili : 3. lT.StxL

iun

Bact f (. r. .-o -d ftekrti Home Seartu Nelscspt Piinc Secum/ Sh&p

01003

Deiner ate alphabet. Block Dnirit?s_A

LiSGIGGMH

0 1 igo : 5' -CCtSrACTCOGGCATCcignggnat g ca- 31 degen~16 Cemp-öZ.fl

5 I G G TT " H ^ üligo: 5'-COKCATCGGCggnat^Oayya-S* ¿eg&r.^lO temp ■ 6 3, .3

G a H H H k cligo^S'-TOWCGGCatgciiYYaWC-S* deg^n-S temp-60,1

Compaement of Slock Qrimt£s_A GGtSHHALN

cinicnt&5gtWtC0tWAAgTTiJ olfpo; S'-$IlCUG£C£ll»t94«tllcgiL<»-3' degen-16 t^mp0,7 0HHHALK13

ccntacgtrrt&WGAACTTGCTCCGG ol igo: 5 r-GGCCTOGTTCAAGGÜStrr tgcatncc-3r deg«n-16 te:^—62. J HHHALNESC

taogtrvtrcg&AACTTGCTCCGGG al igo: S*-GGGCCTOGTTCAAGgcrtirrtgcat-S' degen-S terap-60^3

Giocfc Drnntii_C

PEIflLKEHVK

oligo: 5r-CCGGAiiTftCATCCTGftTG5air&ay5trnit5-3' tfflip-60,9 ElILKEffVEG

01 igo ■ 5 * -CGGAGtACAtCCTGATGGAea »yg tri ma rgg - 3 * d^gen -32 t emj>- 61,<4

ilLKENYKGTE o 1 igo : 5' - AG TACATCC TG ATGGAG AACG TGraa rggn 11 yga-3 ' deg&n-32 tfrmp-&C.4

Complement of Block Dnm[?s_C

ZttvKGFEYSQ

OtyttrctniaCCCGAAGCTCCACAGGT -TGGACAOCTCGAAGOCCtliriftirttVtfi-3J tftrtip-61.8

NVKGFEVSQT

ttrcvnKtycaG&HKiccAcaaGTTCT o-i i-go-: S' -TCTTSGACACCTWAAGicytiiinacrtt-S' degen-32 temp-Gl.4

Figure 2.2.18 Part of the CODEHOP result showing suggested PCR primers with maximum degeneracy set to 32.

From Current Protocols in Bioinformatics Online Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.

CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 2 RECOGNIZING FUNCTIONAL DOMAINS UNIT 2.3 Multiple Sequence Alignment Using ClustalW and ClustalX CONTRIBUTORS AND INTRODUCTION

UNIT 2.3 Multiple Sequence Alignment Using ClustalW and ClustalX

CONTRIBUTORS AND INTRODUCTION

Contributed by Julie D. Thompson Institut de Genetique et de Biologie Moleculaire et Cellulaire Illkirch Cedex, France

Toby. J. Gibson

European Molecular Biology Laboratory Heidelberg, Germany

Des G. Higgins University College Cork, Ireland

Published Online: August 2002

The Clustal programs are widely used for carrying out automatic multiple alignment of sets of nucleotide or amino acid sequences. The most familiar version is ClustalW (Thompson et al., 1994), which uses a simple text menu system that is portable to more or less all computer systems. ClustalX (Thompson et al., 1997) features a graphical user interface and some powerful graphical utilities for aiding the interpretation of alignments, and is the preferred version for interactive usage. ClustalW and ClustalX are developed in parallel, and the same version-numbering system is used for both in order to synchronize changes (e.g., bug fixes, improvements, and additions). In January 2002, the latest version for both programs was 1.81. The programs can both be run interactively, but the protocols below give instructions on how to do this using ClustalX. Alternatively, ClustalW supports a full command-line interface which allows it to be used automatically as part of larger analyses (e.g., it can be run from scripts). In the simplest usage (see Basic Protocol), the programs are employed to take a set of homologous sequences (all DNA/RNA or all protein) and to produce a single multiple alignment. This covers the vast majority of Clustal usage and will be sufficient for most cases. Nonetheless, Clustal also has extensive facilities for adding sequences to existing alignments, merging existing alignments (so-called profile alignment as described in the Alternate Protocol), realignment of sections of alignment, detecting and fixing alignment errors, and basic phylogenetic analysis. Users may run Clustal remotely from several sites using the Web, or the programs may be downloaded to be run locally on PCs, Macintosh, or Unix computers (Support Protocol).

From Current Protocols in Bioinformatics Online Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.

CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 2 RECOGNIZING FUNCTIONAL DOMAINS UNIT 2.3 Multiple Sequence Alignment Using ClustalW and ClustalX

BASIC PROTOCOL: USING CLUSTALW AND CLUSTALX TO DO MULTIPLE ALIGNMENTS

BASIC PROTOCOL: USING CLUSTALW AND CLUSTALX TO DO MULTIPLE ALIGNMENTS

The programs ClustalW and ClustalX provide alternative user interfaces to the Clustal multiple alignment software. The alignments produced by the two programs are exactly the same; the only difference between ClustalW and ClustalX is the way in which the user interacts with the program. ClustalW is now mainly used as a command-line program by Web servers and automatic batch systems, although the program does provide text menus which can be used to input sequences and perform multiple alignments. Most users who run Clustal interactively now use the graphical interface provided by ClustalX. This protocol therefore uses ClustalX (here on a Silicon Graphics Unix workstation) to illustrate the basic multiple alignment procedure. Although the example given here uses protein sequences, the same protocol can be performed with nucleic acid sequences.

Necessary Resources

Hardware

Unix (including Linux) workstation (e.g., Sun, Alpha, Silicon Graphics, PC), PC with MS Windows, or Power Macintosh

Software

ClustalW or ClustalX program (see Support Protocol) Files

Sequences can be input to both ClustalW and ClustalX in one of seven file formats. All sequences must be in the same file. The formats that are automatically recognized are: NBRF/PIR, EMBL/Swiss-Prot, Pearson (FASTA; APPENDIX 1B), Clustal, GCG/MSF, GCG9/RSF, and GDE flat file. The sequences must be all nucleotide or all amino acid, and the program will attempt to guess which by the composition of the letters. Upper- or lowercase can be used and most symbols and numbers will be ignored (removed); unrecognized residues will be counted as X or N.

If using a word processor to prepare the input file, save the data file as plain text with line breaks—i.e., as a simple ASCII file. ClustalX cannot deal with native word processor formats.

1. Download and install ClustalX on your local machine (see Support Protocol).

Construct an initial alignment with the default parameters

2. Start a ClustalX session. On PC and Macintosh computers, click on the ClustalX icon. On Unix systems, at the prompt type clustalx &.

The ClustalX window will appear, as shown in Figure 2.3.1. The window on Unix or PC systems has a series of menu items across the top. For Macintosh users, the menu items are displayed at the top of the screen, separate from the ClustalX window itself. Options can be selected by moving the mouse cursor to one of the menu items and clicking the left mouse button to display the list of menu options under that item, then moving the cursor to the appropriate option and clicking the mouse button again.

3. Load sequences in ClustalX. Select Load Sequences from the File menu in the ClustalX window.

A new window will appear (Fig. 2.3.2) that displays the user's subdirectories and files.

4. Select a file containing the unaligned sequences. Use the mouse cursor to highlight the filename in the file selection window, then click the OK button at the bottom of the window.

If the selected file contains more than one sequence and these are in one of the seven recognized file formats, then the unaligned sequences will be displayed in the ClustalX window (Fig. 2.3.3) with the sequence names on the left-hand side. Figure 2.3.3 shows the sequences of five immunoglobulin superfamily domains for which the three-dimensional structures have been resolved. The sequence alignment is for display only; it cannot be edited here. A ruler is displayed below the sequences, starting at 1 for the first residue position (residue numbers in the sequence input file are ignored). The line above the alignment is used to mark strongly conserved positions. Sequence residues are colored to highlight conserved features in a multiple alignment. At this stage, as the sequences are not yet aligned, the residue coloring will not be informative. ClustalX also provides an indication of the quality of an alignment by plotting a "conservation score" below the alignment.

5. By default, the output file of the program is produced in Clustal format, which can be read by many other sequence-analysis packages. To change this, select the output format using Output Format Options window, selected from the Alignment menu (Fig. 2.3.4). The user can save the final multiple alignment in one (or more than one) of six file formats: Clustal, NBRF/PIR, GCG/MSF, PHYLIP, NEXUS or GDE. Select the output file options and close the Output Format Options window by clicking the Close button.

The different output file formats are provided for compatibility with a wide range of multiple alignment analysis programs. Users can also change the default case of the residues from lowercase to uppercase for GDE output by clicking the appropriate button in this window. Residues are not normally numbered in the output, but users can choose to use numbers here. The order of the sequences is changed to reflect the order of alignment. Crudely, this puts similar sequences beside each other in the output. This can be changed by setting the output order to be the same as the input order. Finally, the values of the parameters (e.g., gap penalties, amino acid weight matrix) can be printed out in the output file by changing the Parameter output option in this window to On.

The output files are produced as plain text or ASCII. Use a fixed-space font such as Courier to view these using a word-processing package. This ensures that the aligned residues from the different sequences will be placed neatly in columns.

6. Construct a multiple alignment of the sequences by selecting the Do Complete Alignment option from the Alignment menu. A new window will appear (Fig. 2.3.5) that displays the default filenames for the output guide tree file and the output alignment file. If required, these filenames may be edited, before clicking on the Align button.

ClustalX will perform the complete multiple alignment of the sequences shown in the window. The alignment consists of three steps: first, all the sequences are compared to each other in a pairwise fashion; next, a guide tree is created from the pairwise sequence distances and written to a file; finally, the multiple alignment is built up following the order given by the guide tree (see Background Information). The current status of the alignment process is continuously updated in the message area at the bottom of the ClustalX window. When the alignment is complete, the window display is updated to show the aligned sequences with gaps represented by "-" characters (Fig. 2.3.6). Evaluate and realign if necessary

7. Examine the multiple alignment in the ClustalX window. The ClustalX graphical interface offers several methods of analyzing the multiple alignment (see Guidelines for Understanding Results).

First, strongly conserved positions are indicated on the line above the alignment. The "*" character indicates positions which have a single, fully conserved residue. e.g., the conserved tyrosine in column 85. The ":" and "." characters indicate that the column is "strongly" or "weakly" conserved, respectively. The definitions of strong and weak conservation are described in detail in the ClustalX documentation. These depend on the amino acid scoring system being used and can be changed by the user (see step 8). These symbols ("*",":" and ".") are also included in the output text file when Clustal format is used.

Second, the sequence residues are colored either by assigning a color to specific residues (default), or on the basis of an alignment consensus. In the latter case, the alignment consensus is calculated automatically, and the residues in each column are colored according to the consensus character assigned to that column. In this way, the user can choose to highlight, for example, conserved hydrophilic or hydrophobic positions in the alignment. More details about the ClustalX color scheme and how to customize it are given in the documentation and in the on-line help. These colored alignments cannot be seen in the normal alignment output files. To print these out using the colors, produce a PostScript file (see step 12) and print it with a PostScript-capable printer.

Third, the quality curve displayed below the alignment plots a "conservation" score for each column in the alignment. A high score indicates a well conserved column; a low score indicates low conservation. The algorithm used to calculate the quality scores is described in detail in Thompson et al. (1997).

Finally, there are extensive facilities for directly highlighting sections of sequences or blocks of alignment that appear to be very unreliable or poorly aligned, or where the alignment is very ambiguous. These facilities are found under the Quality item of the main menu at the top of the ClustalX window. This is invaluable where one suspects that a sequence is not homologous to the rest of the sequences in a data set, or has sequencing errors or where one wishes to select reliably aligned regions of an alignment for further analysis.

8. Change the alignment parameters. If the alignment that is obtained using default settings is not optimal, i.e., if the alignment shows no clearly conserved blocks separated by gapped regions, or if conserved residues or motifs have been misaligned in some sequences (see Guidelines for Understanding Results), the user can modify a large number of alignment parameters. Pairwise alignment parameters will mainly affect the speed/sensitivity of the initial alignments that are used to construct the guide tree, but will not normally have a great effect on the final multiple alignment. In contrast, the multiple alignment parameters control exactly how the final multiple alignments are carried out. To modify the alignment parameters, select the Alignment Parameters option from the Alignment menu, then select either Pairwise Alignment Parameters or Multiple Alignment Parameters. Figure 2.3.7 displays the default settings.

Under Pairwise Parameters, the most important choice is that between Slow-Accurate and Fast-Approximate pairwise alignments. The Accurate alignments are carried out using a dynamic programming method (Myers and Miller, 1988; UNIT 3.1) to align every pair of sequences. This may be too slow for large numbers (e.g., >100) of long (e.g., >1000 residue) sequences. In this case, the Fast/Approximate alignments using the method of Wilbur and Lipman (1983) may be more suitable. These are several orders of magnitude faster to construct than the former and allow huge data sets to be aligned. The effects on the accuracy of the final alignments are minor except in cases where the alignment is especially difficult.

Under Multiple Parameters, each step in the final multiple alignment consists of aligning two alignments or sequences. This is done progressively, following the branching order in the guide tree. The multiple alignment parameters window allows the user to change the scoring matrices and the penalties for opening and extending gaps in the sequences. Gap penalties usually need to be altered for aligning nucleic acids, e.g., they are likely to require reduction if divergent sequences are present in the set. In this case, a gap-opening penalty of 7.5 and a gap extension penalty of 3.33 may be more appropriate. For proteins, this is not so often the case, as there is a (hidden) scaling for divergence built into the algorithm.

The Delay Divergent Sequences option delays the alignment of the most distantly related sequences. These sequences are usually the most difficult to align correctly, and it is generally better to delay their incorporation into the alignment until the more easily aligned sequences are aligned. By default, sequences sharing less than 30% residue identity with all other sequences are delayed. If this option is set to 0, the alignment will follow the guide tree exactly. For alignments containing a large number of sequences (e.g., more than 100), it may be useful to reduce the Delay option to 20% or even 10% residue identity.

Invoking the Use Negative Matrix option ensures that the best matching subregion of the alignment will be found. This is a useful precaution when the sequences may be related only over a small part of their full lengths, as often occurs when a sequence set is taken directly from a database search output. However, for sequences that are related over their entire lengths, the default gives slightly (but clearly) better alignments.

For nucleic acid sequences, the Transition Weight option gives transitions (A«G or C«T, i.e., purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0 and 1; a weight of 0 means that the transitions are scored as mismatches, while a weight of 1 gives the transitions the match score. For distantly related DNA sequences, the weight should be near zero; for closely related sequences it can be useful to assign a higher score.

The ClustalX alignment options are described more fully in the documentation and in the on-line help that is available by selecting the Help menu in the ClustalX window.

9. Rebuild the multiple alignment. If the pairwise parameters have been changed, it will be necessary to rebuild the complete multiple alignment, as described in step 6, in order to make a new alignment. If only the multiple alignment parameters have been changed, the first stages (pairwise alignments, guide tree) can be reused by using the Do Alignment from Guide Tree option, selected from the File menu.

In the latter case, a window appears with the default filenames of the input guide tree (written during the multiple alignment process in step 6), and the output alignment file (Fig. 2.3.8). If the user changes the file names in step 6, a similar change should be made when running the alignment from an existing tree guide. ClustalX will perform only the final multiple alignment of the sequences shown in the window. When the alignment is complete, the window display is updated to reflect the new multiple alignment.

10. Perform alignment quality control. To highlight sections of sequences or blocks of alignment that are unreliable or badly aligned in the ClustalX window, select the Show Low Scoring Segments option from the Quality menu.

Sequence segments which obtain low quality scores are displayed with white characters on a black background (Fig. 2.3.9). These segments may be due to one of various reasons—e.g., (i) partial or total misalignments caused by a failure in the alignment algorithm, (ii) partial or total misalignments because at least one of the sequences in the given set is partly or completely unrelated to the other sequences, or (iii) frameshift translation errors in a protein sequence causing local mismatched regions to be heavily highlighted. The calculation of the ClustalX alignment quality scores is described in the documentation and in the on-line help.

11. Save the alignment. During the alignment process, the final multiple alignment is automatically written to the output file. This file may be specified by the user or the default may be used (the name and the format type are normally chosen by default; see step 6). In addition, after the multiple alignment is completed, the user has the option of changing the output file format or saving only a selected part of the whole alignment and getting the output alignment written out to a file again. Select the Save Sequences As option from the File menu.

A window will appear (Fig. 2.3.10) offering the user a choice of one of the six output formats (see step 5). Options are also available to switch between Upper/Lower case for GDE files, to output Sequence Numbering for Clustal files, and to save a range of the alignment. In addition, the output filename may be specified by the user. Clicking on the OK button will save the sequence alignment to the selected file.

12. Create a PostScript image of the alignment. The ClustalX alignment display can be saved in a PostScript file, which can then be either sent directly to a printer or loaded into a graphics-editing program. This is done by selecting the Write alignment as PostScript option from the File menu.

A window will appear with a number of options for customizing the PostScript output (Fig. 2.3.11). The options are explained in detail in the ClustalX documentation and on-line help. The file will automatically include the colored sequences, and the consensus and ruler lines. The Alignment Quality curve can be optionally included in the output file.

From Current Protocols in Bioinformatics Online Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.

CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 2 RECOGNIZING FUNCTIONAL DOMAINS UNIT 2.3 Multiple Sequence Alignment Using ClustalW and ClustalX

ALTERNATE PROTOCOL: USING CLUSTALW AND CLUSTALX FOR PROFILE ALIGNMENTS

ALTERNATE PROTOCOL: USING CLUSTALW AND CLUSTALX FOR PROFILE ALIGNMENTS

ClustalW and ClustalX allow the user to reuse an old alignment and add new sequences to it, or even merge two alignments together. This is known as profile alignment (the term profile analysis was first used by Gribskov et al., 1987). This is useful in any ongoing project where new sequences are being generated and alignments need updating. Adding new sequences to an old alignment has some advantages. First, it is much faster than redoing the alignment from scratch each time. Second, the original sequence alignment is kept intact, which is especially useful if the alignment had been hand-edited. A profile is simply an alignment of one or more sequences (e.g., an alignment output file from Clustal). One or both sets of input sequences may include secondary structure assignments or gap penalty masks to guide the alignment. Profile alignment allows the user to read in an old alignment (in any of the allowed input formats) and align one or more new sequences to it.

Necessary Resources

Hardware

Unix (including Linux) workstation (e.g., Sun, Alpha, Silicon Graphics, PC), PC with MS Windows, or Power Macintosh

Software

ClustalW or ClustalX program (see Support Protocol) Files

Sequences and existing alignments can be input to both ClustalW and ClustalX in one of seven file formats. All sequences must be in the same file. The formats that are automatically recognized are: NBRF/PIR, EMBL/Swiss-Prot, Pearson (FASTA; APPENDIX 1B), Clustal, GCG/MSF, GCG9/RSF, and GDE flat file. In the examples here, unaligned sequences are in FASTA format and existing alignments are in Clustal and GCG/MSF formats.

Merge two existing alignments

1. Download and install ClustalX on a local machine (see Support Protocol).

2. Start a ClustalX session (see Basic Protocol, step 2) and switch to Profile Alignment Mode by clicking on the Multiple Alignment Mode toggle button just above the sequence display area.

The single sequence display area will be replaced by two display areas (Fig. 2.3.12). Initially, both areas are empty.

3. Load the first profile by selecting the Load Profile 1 option from the File menu. A file selection window will appear, allowing the user to select a file. The procedure is similar to that used for loading unaligned sequences (see Basic Protocol, steps 3 to 4). Profile 1 should contain a single sequence or an existing alignment of two or more sequences, e.g., an alignment file that was produced by ClustalX at an earlier stage (these file names have the extension .ain).

The selected alignment will be displayed in the top half of the ClustalX window (Fig. 2.3.13). See Basic Protocol, step 4, for a description of the alignment display. In Figure 2.3.13, the alignment consists of immunoglobulin superfamily domain sequences, generated with default parameters.

4. Load the second profile by selecting the Load Profile 2 option from the File menu. The procedure is the same as that used for loading the first profile. Profile 2 should contain a single sequence or several aligned sequences.

The selected alignment will be displayed in the bottom half of the ClustalX window (Fig. 2.3.14). The example alignment shown here contains sequences belonging to the C-2-type subfamily of the immunoglobulins.

5. Optional: Supply secondary structure and/or gap penalty masks with the input sequences used during profile alignment (note that the secondary structure information is not used during multiple sequence alignment).

The secondary structure elements can be read from Swiss-Prot, Clustal, or GDE format input files. For many 3-D protein structures, secondary structure information is recorded in the feature tables of Swiss-Prot database entries and ClustalX recognizes Swiss-Prot HELIX and STRAND assignments. Alternatively, the Clustal or GDE files can be edited manually. The format for the masks is described in the documentation and in the on-line help.

ClustalX reads the structure or gap penalty masks automatically when a profile is loaded in Profile Alignment Mode and displays the information in the ClustalX window above the alignment display (Fig. 2.3.15). The masks work by raising gap penalties in specified regions (typically secondary structure elements) so that gaps are preferentially opened in the less well conserved regions (typically surface loops). The values for raising the gap penalty at particular secondary structure elements may be modified using the Alignment Parameters, Secondary Structure Parameters options from the Alignment menu.

6. Align the two profiles by selecting Align Profile 2 to Profile 1 from the Alignment menu. A window will appear (Fig. 2.3.16) that displays the default filenames for the output guide tree files and the output alignment file. If required, these filenames may be edited by the user before clicking on the Align button.

ClustalX will align the two profiles together to form a single multiple alignment. The original alignments are not altered. The two profiles are simply aligned together by introducing complete columns of gaps into one or both of the profiles. The current status of the alignment process is continuously updated in the message area at the bottom of the ClustalX window. When the alignment is complete, the window display areas are updated to show the aligned profiles. Clicking on the Lock Scroll button just above the top display area will remove the horizontal scroll bar from the top display area (Fig. 2.3.17). The single remaining scroll bar at the bottom of the window will then allow both profile display areas to be scrolled together.

A second option is to align the sequences from the second profile, one at a time, to the first profile. This is useful for incorporating a set of new sequences (not aligned) into an older alignment. The procedure to follow is very similar to that used above to merge two existing alignments. In this case, however, the second profile should contain one or more unaligned sequences. Each sequence is aligned individually with the existing alignment, starting with the most closely related. In step 6 above, the sequences can be aligned to profile 1, by selecting the Align Sequences to Profile 1 option from the Alignment menu.

7. Merge the two profiles by switching back to multiple alignment mode using the toggle button just above the top sequence display area.

The sequences from both profiles are merged into a single alignment (Fig. 2.3.18).

From Current Protocols in Bioinformatics Online Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.

CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 2 RECOGNIZING FUNCTIONAL DOMAINS UNIT 2.3 Multiple Sequence Alignment Using ClustalW and ClustalX SUPPORT PROTOCOL: OBTAINING THE CLUSTALW AND CLUSTALX PROGRAMS

0 0

Post a comment