Gene Identification: Methods and Considerations

Andreas D. Baxevanis1

1 National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
Publication Name:  Current Protocols in Human Genetics
Unit Number:  Unit 6.6
DOI:  10.1002/0471142905.hg0606s29
Online Posting Date:  August, 2001
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

This unit introduces readers to some of the more commonly used techniques for gene identification. The author discusses the general problem behind accurately predicting genes in both prefinished and finished sequence data, provides a handson description of programs available in the public domain, and suggests strategies for how to best tackle the prediction problem at various stages of data generation and assembly.This unit introduces readers to some of the more commonly used techniques for gene identification.

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Gene Identification Methods
  • How Well do the Methods Work?
  • Strategies and Considerations
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

  •   FigureFigure 6.6.1 The central dogma of molecular biology. Proceeding from the DNA through the RNA to the protein level, various sequence features and modifications can be identified that can be used in the computational deduction of gene structure. These include the presence of promoter and regulatory regions, intron‐exon boundaries, and both start and stop signals. Unfortunately, these signals are not always present, and when present may not always be in the same form or context. The reader is referred to the text for greater detail.
  •   FigureFigure 6.6.2 XGRAIL output obtained using the human BAC clone RG364P16 from 7q31 as the query. The upper window shows the results of the prediction, with the histogram representing the probability that a given stretch of DNA is an exon. The various colored bars in the center represent features of the DNA (e.g., arrows represent repetitive DNA, and vertical bars represent repeat sequences). Exon and gene models, protein translations, and the results of a genQuest search using the protein translation are shown. The reader is referred to UNIT for more details on the interpretation of XGRAIL output.
  •   FigureFigure 6.6.3 FGENES output obtained using the human BAC clone RG364P16 from 7q31 as the query. The columns, going from left to right, represent the gene number (G), strand (Str), feature (described in the main text), start and end points for the predicted exon, a scoring weight, and start and end points for corresponding open reading frames (ORF‐start and ORF‐end). Each predicted gene is shown as a separate block. Following the tables are protein translations of any predicted gene products.
  •   FigureFigure 6.6.4 MZEF output obtained using the human BAC clone RG364P16 from 7q31 as the query. The columns, going from left to right, give the location of the prediction as a range of included bases (Coordinates), the probability value (P), frame preference scores (Fr i), an ORF indicator showing which reading frames are open, and scores for the 3′ splice site, coding regions, and 5′ splice site.
  •   FigureFigure 6.6.5 GENSCAN output obtained using the human BAC clone RG364P16 from 7q31 as the query. The columns, going from left to right, represent the gene and exon number (Gn.Ex), the type of prediction (Type), the strand on which the prediction was made (S, with + indicating the forward strand and − as the reverse), the beginning and endpoint for the prediction (Begin and End), the length of the prediction (Len), the reading frame of the prediction (Fr), several scoring columns, and the probability value (P). Each predicted gene is shown as a separate block; notice that the third gene has its exons listed in reverse order, reflecting the fact that the prediction is on the reverse strand. Following the tables are the protein translations for each of the three predicted genes.
  •   FigureFigure 6.6.6 GENSCAN output in graphical form, obtained using the human BAC clone RG364P16 from 7q31 as the query. Optimal and suboptimal exons are indicated, and the initial and terminal exons show the direction in which the prediction is being made (5′ → 3′ or 3′ → 5′).
  •   FigureFigure 6.6.7 Comparison of GENSCAN with GenomeScan, using the human BRCA1 gene sequence as the query. The GENSCAN prediction (top line) is missing a number of the exons that appear in the annotation for the BRCA1 gene (second line; GenBank L78833), and the GENSCAN prediction is slightly longer than the actual gene at the 5′ end. The inclusion of BLASTX hit information (vertical bars closest to the scale) in GenomeScan produces a more complete and accurate prediction (third line).
  •   FigureFigure 6.6.8 Sensitivity vs. specificity. In the upper portion of the figure, the four possible outcomes of a prediction are shown: a true positive (TP), a true negative (TN), a false positive (FP), and a false negative (FN). The matrix at the bottom of the figure shows how both sensitivity and specificity are determined from these four possible outcomes, giving a tangible measure of the effectiveness of any gene prediction method. (Figure adapted from Burset and Guigó, , and Snyder and Stormo, .)
  •   FigureFigure 6.6.9 Annotated output from GeneMachine showing the results of multiple gene prediction program runs. NCBI Sequin is used at the viewer. At the top of the output are shown the results from various BLAST runs (BLASTN vs. DbEST, BLASTN vs. nr, and BLASTX vs. SWISS‐PROT). Towards the bottom of the window are shown the results from the predictive methods (FGENES, GENSCAN, MZEF, and GRAIL 2). Annotations indicating the strength of the prediction are preserved and shown wherever possible within the viewer. Putative regions of high interest would be areas where hits from the BLAST runs line up with exon predictions from the gene prediction programs.

Videos

Literature Cited

Literature Cited
   Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucl. Acids Res. 25:3389‐3402.
   Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268:78‐94.
   Burge, C.B. and Karlin, S. 1998. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8:346‐354.
   Burset, M. and Guigó, R. 1996. Evaluation of gene structure prediction programs. Genomics 34:353‐367.
   Chothia, C. and Lesk, A.M. 1986. The relation between the divergence of sequence and structure in proteins. E.M.B.O. J. 5:823‐826.
   Claverie, J.M. 1997a. Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet. 6:1735‐1744.
   Claverie, J.M. 1997b. Exon detection by similarity searches. Methods. Mol. Biol. 68:283‐313.
   Claverie, J.M. 1998. Computational methods for exon detection. Mol. Biotechnol. 10:27‐48.
   Everett, L.A., Glaser, B., Beck, J.C., Idol, J.R., Buchs, A., Heyman, M., Adawi, F., Hazani, E., Nassir, E., Baxevanis, A.D., Sheffield, V.C., and Green, E.D. 1997. Pendred syndrome is caused by mutation in a putative sulphate transporter gene (PDS). Nature Genet. 17:411‐422.
   Gelfand, M.S., Mironov, A.A., and Pevzner, P.A. 1996. Gene recognition via spliced sequence alignment. Proc. Natl. Acad. Sci. U.S.A. 93:9061‐9066.
   Guigó, R. 1997. Computational gene identification. J. Mol. Med. 75:389‐393.
   Guigó, R., Knudsen, S., Drake, N., and Smith, T. 1992. Prediction of gene structure. J. Mol. Biol. 226:141‐157.
   Krogh, A. 1997. Two methods for improving performance of an HMM and their application for gene finding. In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (T. Gaasterland, P. Karp, K. Karplus, C. Ouzounis, C. Sander, and A. Valencia, eds.), pp. 179‐186. AAAI Press, Menlo Park, Calif.
   Harris, N.L. 1997. Genotator: A workbench for sequence annotation. Genome Res. 7:754‐762.
   Kuehl, P., Weisemann, J., Touchman, J., Green, E., Boguski, M. 1999. An effective approach for analyzing “prefinished” genomic sequence data. Genome Res. 9:189‐194.
   Liu, A.Y., Torchia, B.S., Migeon, B.R., and Siliciano, R.F. 1997. The human NTT gene: Identification of a novel 17‐kb noncoding nuclear RNA expressed in activated CD4+ T cells. Genomics 39:171‐184.
   Makalowska, I., Ryan, J., and Baxevanis, A. 1999. GeneMachine: A unified solution for performing content‐based, site‐based, and comparative gene prediction methods. 12th Cold Spring Harbor Meeting on genome mapping, sequencing and Biology Cold Spring Harbor, NY.
   Mural, R.J., Einstein, J.R., Guan, X., Mann, R.C., and Uberbacher, E.C. 1992. An artificial intelligence approach to DNA sequence feature recognition. Trends Biotech. 10:67‐69.
   Pearson, W.R., Wood, T., Zhang, Z., and Miller, W. 1997. Comparison of DNA sequences with protein sequences. Genomics 46:24‐36.
   Rogic, S., Mackworth, A., and Ouellette, B.F.F. 2001. Evaluation of Gene‐Finding Programs. In press.
   Snyder, E.E. and Stormo, G.D. 1993. Identification of coding regions in genomic DNA sequences: An application of dynamic programming and neural networks. Nucl. Acids Res. 21:607‐613.
   Snyder, E.E. and Stormo, G.D. 1997. Identifying genes in genomic DNA sequences. In DNA and Protein Sequence Analysis (M.J. Bishop and C.J. Rawlings eds.) pp. 209‐224. Oxford University Press, New York.
   Solovyev, V.V., Salamov, A.A., and Lawrence, C.B. 1994a. Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucl. Acids Res. 22:5156‐5163.
   Solovyev, V.V., Salamov, A.A., and Lawrence, C.B. 1994b. The prediction of human exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Ismb 2:354‐362.
   Solovyev, V.V., Salamov, A.A., and Lawrence, C.B. 1995. Identification of human gene structure using linear discriminant functions and dynamic programming. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology (C. Rawlings, D. Clark, R. Altman, L. Hunter, T. Langauer, and S. Wodak eds.) pp. 367‐375.
   Uberbacher, E.C. and Mural, R.J. 1991. Locating protein‐coding regions in human DNA sequences by a multiple sensor–neural network approach. Proc. Natl. Acad. Sci. U.S.A. 88:11261‐11265.
   Wevrick, R., Kerns, J.A., and Francke, U. 1996. The IPW gene is imprinted and is not expressed in the Prader‐Willi syndrome. Acta Genet. Med. Gemollol. 45:191‐197.
   Zhang, M.Q. 1997. Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc. Natl. Acad. Sci. U.S.A. 94:565‐568.
   Zhang, J. and Madden, T.L. 1997. PowerBLAST: A new network BLAST application for interactive or automated sequence analysis and annotation. Genome Res. 7:649‐656.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library