Is there a generalizable recipe for the analysis of large genomic stretches for all organisms? There are huge differences between organisms in the composition and overall structure of genomes. Despite the general rule that intergenic and intronic sequences have a more or less pronounced difference in AT content compared to exons, AT contents in genomes vary over a wide range. For example, the genome of Borrelia burgdorferi has a GC content of only 20 percent whereas Streptomyces coelicolor has 69 per cent GC content.
Even between organisms with comparable AT contents within coding regions, the usage of nucleotides often has pronounced differences. On the one hand, this allows us to distinguish DNA from different organisms based on hexanucleotide frequency, and on the other it complicates analysis procedures and gene detection. Thus, underlying parameters for gene detection, such as GC content and coding probability, differ among organisms. To achieve a maximum fidelity in gene detection, training and adjustment of the programs has to be done individually for each organism, as even subtle differences in individual peculiarities accumulate and contribute to errors in gene prediction.
A special case appears in the nucleotide and codon composition of genes in monocotyledonous plants, e.g., genes in grasses such as rice, maize, wheat or barley. Grass genes contain a gradient of GC content along most of their genes. In these genes a higher GC content is observed at the 5' part of the gene compared with the 3' part. Comparison of orthologous genes between dicotyledonous and monocotyledonous plants typically show a pronounced slope in GC content (Wong et al., 2002). This peculiarity poses a challenge for gene identification and the correct delineation of gene structure, as the computer programs have to cope with a pronounced variation of both GC content and codon usage in the 5' to 3' course of the individual genes.
Was this article helpful?