Necessary Resources

Hardware

Unix/Linux workstation with at least 256 Mb RAM (recommended) Software geneid v1.1 full distribution (see Support Protocol) Files

All of the sequences used within this unit have been extracted from the draft of the human genome (release August, 2001, University of California, Santa Cruz) and can be found at the samples subdirectory within the geneid distribution (see Support Protocol). These sequences can also be found in the Current Protocols in Bioinformatics Web site at http://www.currentprotocols.com/. Files used throughout this unit are:

example1.fa (contains a 32-kb region of human chromosome 21) example2.fa (contains a 47-kb region of human chromosome 22) example3.fa (contains a 32-kb region of human chromosome 15)

example2.evidences.gff, example3.EST1.gff, example3.EST2.gff, example3.EST3.gff and example3.promoter.gff contain annotated gene features on the above sequences

NOTE: In a Unix system, the syntax to use geneid is:

geneid [options] -P parameter_file input_sequence where parameter_file is a file containing gene-model parameters for a given species (or taxonomic group), which the user normally downloads with the geneid distribution, and input_sequence is a file containing a DNA sequence in FASTA format (APPENDIX 1B). A number of options allow modification of the geneid default behavior.

The following assumes that geneid has been successfully installed in a directory within the file system (the geneid directory), and that this directory is the current working directory (see Support Protocol).

NOTE: An introduction to the Unix environment can be found in APPENDIX 1C.

1. Run geneid on the first example (example1.fa) with default options:

%geneid -P param/human3iso.param samples/example1.fa geneid is a Unix command-line program that requires as input a file containing a DNA sequence in FASTA format

(samples/example1.fa; see APPENDIX 1B for discussion of FASTA format), and a parameter file. This is specified by using the option -p followed by the name of the parameter file. geneid provides parameter files for human (this example), Drosophila melanogaster, and other species in the directory params/. By default, geneid produces results in plain text, which are sent to the standard output (Unix terminal). These can then be redirected to a file or another program. In particular, they can serve as input to programs producing graphical visualization of genomic annotations, such as gff2ps or apollo (see Basic Protocol 2).

2. Examine the results returned by geneid.

By default, geneid output consists of a series of genes predicted along the input sequence. geneid uses its own default output format. Other, more standard, formats can be specified via command-line options (see steps 5, 6, and 7). Predicted genes are described as lists of potential coding exons. For sequence example1, geneid predicts an eight-exon gene (see Fig. 4.3.1 for plain text output and Figs. 4.3.5 and 4.3.6 for graphical representations).

Each exon is defined by a start signal (start codon or acceptor site), an end signal (donor site or stop codon), the strand, and the frame. Each exon (as well as each signal) is assigned a score. The score depends on the scores of the defining sites, and on the nucleotide composition of the exon sequence, measuring the likelihood of the exon (see Background Information). The score of a gene is the sum of the scores of its exons. geneid predicts only the coding fraction of a gene. Thus, geneid defines four classes of exons—First, Internal, Terminal, and Single (corresponding to single-exon or intronless genes). A multiexon gene starts with First exon (start codon to donor site), followed by any number (possibly zero) of Internal exons (acceptor site to donor site), and ends with a Terminal exon (acceptor site to stop codon). An intronless gene is constituted by a Single exon (start codon to stop codon).

Lines starting with the # character do not correspond to coding exons, but provide additional information about the prediction. At the top of the output, two lines starting with the characters ## display general information on the geneid process. After this main header, the line beginning with # Sequence displays the name and the length of the input sequence whereas the line starting with #Optimal Gene Structure contains the number of genes predicted along the input sequence as well as the total score of the prediction, which is the sum of the scores of the predicted genes. Then, lines starting with # Gene provide general information on each gene: gene identifier, strand (forward or reverse), number of exons, gene product length, and gene score. After this, there is a line for each coding exon in the gene with the fields (from left to right) defined as in Table 4.3.1.

After the set of lines corresponding to the exons of the predicted gene, the amino acid sequence of the gene is printed in FASTA format (APPENDIX 1B).

The frame and remainder (see Table 4.3.1) of an exon are the number of hanging nucleotides not included in complete codons at the left/right ends of exons when these are assembled into a gene. The formal definition of geneid frame is "The number of nucleotides (0,1,2) from the first nucleotide in the exon to the first nucleotide in the first complete codon in the same exon." The remainder is defined in geneid as "The number of nucleotides left (0,1,2) after the last complete codon has been translated from the exon sequence, given its frame." By definition, then, all First exons have frame 0 (as in Fig. 4.3.2), and all Terminal exons have remainder 0.

3. Obtain the set of predicted Start codons along the input sequence by typing:

%geneid -P param/human3iso.param -bo samples/example1.fa

In addition to the predicted genes, geneid provides a number of options which allow the investigator to print an exhaustive list of all the sequence signals and exons predicted along the query sequence (most of which are not included in the final gene prediction). This option can be useful, for instance, to carry out a detailed analysis of a small genomic region for potential alternative splice sites. If only information on these sites and exons is required, it may be advisable to use the option -o, which switches off the gene-assembly engine, therefore consuming a smaller amount of memory and running time.

In the example, the results of which are shown in Figure 4.3.2 (top), potential start codons are displayed by using the option -b. Other signals such as Stop codons, Acceptor splice sites, or Donor splice sites can be printed using the options -e, -a, or -d, respectively. All options can be specified at once—e.g., geneid -bead (in any order)—and geneid then produces the exhaustive list of all potential sequence signals. For large (and not so large) genomic sequences, this can produce very large outputs.

Each signal is printed in a separate record (line) with the following fields: type of signal, position, score, strand, and signal sequence. As geneid internally splits the input sequence into consecutive fragments, signals found both in forward and reverse strands are displayed for every fragment, after a header specifying the fragment positions in the input sequence.

4. Obtain the set of predicted First exons along the input sequence by typing:

%geneid -P param/human3iso.param -fo samples/example1.fa qeneid can also print all candidate exons along the query sequence. The options -f, -i, -t, and -s are provided to print the predicted exons of each class (First, Internal, Terminal, and Single). The options can be combined to print more than one class of exons. In such a case, exons are printed separately by class. If exons of all classes are to be printed, it is advisable to use just the option -x which prints the list of all exons sorted by position. As shown in Figure 4.3.2 (bottom), each predicted exon is printed in a separate record containing the fields 1 to 11 as described in Table 4.3.1, plus the length and the amino acid sequence of the exon.

5. Obtain a more complete output by using the option -X:

%geneid -P param/human3iso.param -X samples/example1.fa

By using the option -x, geneid produces a more exhaustive output of the gene prediction. Each exon is now described in three different lines (Fig. 4.3.3). The first one describes the exon start signal (as in step 3), the second line describes the exon itself (as in step 4), and the third line describes the exon end signal (as in step 3).

6. Obtain the predicted genes in the GFF format by using the option -g:

%geneid -P param/human3iso.param -G samples/example1.fa > geneid_output.gff

General Feature Format or GFF

(http://www.sanger.ac.uk/Software/formats/GFF/) is a proposed standard format for describing genes and other features associated with DNA, RNA, and protein sequences. Each feature is described as a list of fixed fields or columns delimited by tabs. This format is very easy to parse by bioinformatics applications. There are a number of tools, including visualization ones, that can process GFF files (see Basic Protocol 2). geneid produces GFF-compliant output with the option -g . This option can be applied to any set of gene features selected to be printed (as in steps 2 and 3). The result is shown in Figure 4.3.4. A set of standardized ## lines appear at the top of the GFF file (the GFF header). Then, following the same structure as geneid default format, lines starting with the character # (assumed to be free-format comments in GFF), are used to provide general information on each gene predicted. GFF records provide information about the predicted gene features (from left to right): sequence name, source (the gene prediction program geneid in this case), feature (type of exon), start and end positions, score, strand, frame, and group (gene to which the exon belongs).

7. Obtain the same output in the XML format using the command:

%geneid -P param/human3iso.param -M samples/example1.fa

Extensible Markup Language or XML (http://www.w3.org/XML/) is a language developed from the experience obtained in the creation of SGML (Standard Generalized Markup Language) and HTML (Hypertext Markup Language), which is more widely used on the Internet. XML is basically a format to transfer information between computer programs (non-human-readable). Many parsing and displaying methods are available for XML, which makes it a powerful format to create Web documents. geneid supports XML format for predicted genes by means of the option -m. The DTD (Document Type Definition) of geneid XML documents can be printed with the option -m.

8. Examine the complete list of available options by using the option -h:

%geneid -h

The most relevant options that have not been discussed are:

-v (verbose): This produces real-time detailed information while geneid is processing the input sequence.

-W -C (forward / reverse): This forces prediction in only one strand of the sequence.

-d (CDS sequence): This prints the DNA coding sequence of predicted genes.

-O -r -S (external features): By means of these options, additional information can be provided to geneid in order to modify the "ab initio" prediction (see Basic Protocol 3).

From Current Protocols in Bioinformatics Online Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.

CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 4 FINDING GENES UNIT 4.3 Using geneid to Identify Genes BASIC PROTOCOL 2: VISUALIZING geneid PREDICTIONS

Was this article helpful?

0 0

Post a comment