User Ratings

Your rating: None (1 vote)
Your rating: None (1 vote)
Your rating: None (2 votes)
Add your comments

An Overview of Gene Identification: Approaches, Strategies, and Considerations

Andreas D. Baxevanis1

1National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland

Unit Number: 
Unit 4.1
DOI: 
10.1002/0471250953.bi0401s6
Online Posting Date: 
September, 2004
GO TO THE FULL TEXT:
PDF or HTML at Wiley Online Library
Are you the author of this protocol? Login or register and return to this page.

Abstract

Modern biology is on the verge of officially ushering in a new era in science with the completion of the sequencing of the human genome in April 2003. While often erroneously called the “post-genome era”, this will actually truly mark the beginning of the “genome era,” a time in which the availability of sequence data for many genomes will have a significant effect on how science is performed in the 21st century. This unit offers an overview of many of the gene prediction methods that are currently available and offers a general assessment of how well the methods work for various problems.

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Unit Introduction
  • Remembering Biology in Deducing Gene Structure
  • Categorizing the Methods
  • How Well do the Methods Work?
  • Strategies and Considerations
  • Future Directions
  • Acknowledgments
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

  • Figure 4.1.1
    The central dogma of molecular biology. Proceeding from the DNA through the RNA to the protein level, various sequence features and modifications can be identified that can be used in the computational deduction of gene structure. These include the presence of promoter and regulatory regions, intron-exon boundaries, and both start and stop signals. Unfortunately, these signals are not always present, and when present may not always be in the same form or context. The reader is referred to the text for greater detail.

  • Figure 4.1.2
    Sensitivity vs. specificity. In the upper portion of the figure, the four possible outcomes of a prediction are shown: a true positive (TP), a true negative (TN), a false positive (FP), and a false negative (FN). The matrix at the bottom of the figure shows how both sensitivity and specificity are determined from these four possible outcomes, giving a tangible measure of the effectiveness of any gene prediction method. (Figure adapted from Burset and Guigó, 1996 and Snyder and Stormo, 1997.)

  • Figure 4.1.3
    Annotated output from GeneMachine showing the results of multiple gene prediction program runs. NCBI Sequin is used as the viewer. At the top of the output are shown the results from various BLAST runs (BLASTN vs. dbEST, BLASTN vs. nr, and BLASTX vs. SWISS-PROT). Towards the bottom of the window are shown the results from the predictive methods (FGENES, GENSCAN, MZEF, and GRAIL 2). Annotations indicating the strength of the prediction are preserved and shown wherever possible within the viewer. Putative regions of high interest would be areas where hits from the BLAST runs line up with exon predictions from the gene prediction programs.

Literature Cited

Literature Cited
    Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucl. Acids Res. 25:3389-3402.
    Burset, M. and Guigó, R. 1996. Evaluation of gene structure prediction programs. Genomics 34:353-367.
    Chothia, C. and Lesk, A.M. 1986. The relation between the divergence of sequence and structure in proteins. E.M.B.O. J. 5:823-826.
    Claverie, J.M. 1997a. Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet. 6:1735-2744.
    Claverie, J.M. 1997b. Exon detection by similarity searches. Methods. Mol. Biol. 68:283-313.
    Claverie, J.M. 1998. Computational methods for exon detection. Mol. Biotechnol. 10:27-48.
    Davuluri, R.V., Grosse, I., and Zhang, M.Q. 2002. Computational identification of promoters and first exons in the human genome. Nature Genetics 29:412-417.
    Guigó, R. 1997. Computational gene identification. J. Mol. Med. 75:389-393.
    Guigó, R., Knudsen, S., Drake, N., and Smith, T. 1992. Prediction of gene structure. J. Mol. Biol. 226:141-257.
    Harris, N.L. 1997. Genotator: A workbench for sequence annotation. Genome Res. 7:754-762.
    International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921.
    Kuehl, P., Weisemann, J., Touchman, J., Green, E., and Boguski, M. 1999. An effective approach for analyzing “prefinished” genomic sequence data. Genome Res. 9:189-294.
    Liu, A.Y., Torchia, B.S., Migeon, B.R., and Siliciano, R.F. 1997. The human NTT gene: Identification of a novel 17-kb noncoding nuclear RNA expressed in activated CD4+ T cells. Genomics 39:171-284.
    Makalowska, I., Ryan, J., and Baxevanis, A. 1999. GeneMachine: A unified solution for performing content-based, site-based, and comparative gene prediction methods. 12th Cold Spring Harbor Meeting on genome mapping, sequencing and Biology, Cold Spring Harbor, NY.
    Makalowska, I., Sood, R., Faruque, M.U., Hu, P., Eddings, E.M., Mestre, J.D., Baxevanis, A.D., and Carpten, J.D. 2002. Identification of six novel genes by experimental validation of GeneMachine-predicted genes. Gene 284:203-213.
    Pearson, W.R., Wood, T., Zhang, Z., and Miller, W. 1997. Comparison of DNA sequences with protein sequences. Genomics 46:24-36.
    Rogic, S., Mackworth, A., and Ouellette, B.F.F. 2001. Evaluation of Gene-Finding Programs. Genome Res. 11:817-832.
    Snyder, E.E. and Stormo, G.D. 1993. Identification of coding regions in genomic DNA sequences: An application of dynamic programming and neural networks. Nucl. Acids Res. 21:607-613.
    Snyder, E.E. and Stormo, G.D. 1997. Identifying genes in genomic DNA sequences. In DNA and Protein Sequence Analysis (M.J. Bishop, and, C.J. Rawlings, eds.) pp. 209-224. Oxford University Press, New York.
    Stormo, G.D. 2000. Gene-finding approaches for eukaryotes. Genome Res. 10:511-515.
    Wevrick, R., Kerns, J.A., and Francke, U. 1996. The IPW gene is imprinted and is not expressed in the Prader-Willi syndrome. Acta Genet. Med. Gemollol. 45:191-297.
    Zhang, J. and Madden, T.L. 1997. PowerBLAST: A new network BLAST application for interactive or automated sequence analysis and annotation. Genome Res. 7:649-656.
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library
Looking for Answers?
Do you have tips, tricks, or improvements to share?

Join the Conversation

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.