Gene Identification: Methods and Considerations

Andreas D. Baxevanis1

1 National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
Publication Name:  Current Protocols in Human Genetics
Unit Number:  Unit 6.6
DOI:  10.1002/0471142905.hg0606s29
Online Posting Date:  August, 2001
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

This unit introduces readers to some of the more commonly used techniques for gene identification. The author discusses the general problem behind accurately predicting genes in both prefinished and finished sequence data, provides a handson description of programs available in the public domain, and suggests strategies for how to best tackle the prediction problem at various stages of data generation and assembly.This unit introduces readers to some of the more commonly used techniques for gene identification.

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Gene Identification Methods
  • How Well do the Methods Work?
  • Strategies and Considerations
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

Videos

Literature Cited

Literature Cited
   Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucl. Acids Res. 25:3389‐3402.
   Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268:78‐94.
   Burge, C.B. and Karlin, S. 1998. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8:346‐354.
   Burset, M. and Guigó, R. 1996. Evaluation of gene structure prediction programs. Genomics 34:353‐367.
   Chothia, C. and Lesk, A.M. 1986. The relation between the divergence of sequence and structure in proteins. E.M.B.O. J. 5:823‐826.
   Claverie, J.M. 1997a. Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet. 6:1735‐1744.
   Claverie, J.M. 1997b. Exon detection by similarity searches. Methods. Mol. Biol. 68:283‐313.
   Claverie, J.M. 1998. Computational methods for exon detection. Mol. Biotechnol. 10:27‐48.
   Everett, L.A., Glaser, B., Beck, J.C., Idol, J.R., Buchs, A., Heyman, M., Adawi, F., Hazani, E., Nassir, E., Baxevanis, A.D., Sheffield, V.C., and Green, E.D. 1997. Pendred syndrome is caused by mutation in a putative sulphate transporter gene (PDS). Nature Genet. 17:411‐422.
   Gelfand, M.S., Mironov, A.A., and Pevzner, P.A. 1996. Gene recognition via spliced sequence alignment. Proc. Natl. Acad. Sci. U.S.A. 93:9061‐9066.
   Guigó, R. 1997. Computational gene identification. J. Mol. Med. 75:389‐393.
   Guigó, R., Knudsen, S., Drake, N., and Smith, T. 1992. Prediction of gene structure. J. Mol. Biol. 226:141‐157.
   Krogh, A. 1997. Two methods for improving performance of an HMM and their application for gene finding. In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (T. Gaasterland, P. Karp, K. Karplus, C. Ouzounis, C. Sander, and A. Valencia, eds.), pp. 179‐186. AAAI Press, Menlo Park, Calif.
   Harris, N.L. 1997. Genotator: A workbench for sequence annotation. Genome Res. 7:754‐762.
   Kuehl, P., Weisemann, J., Touchman, J., Green, E., Boguski, M. 1999. An effective approach for analyzing “prefinished” genomic sequence data. Genome Res. 9:189‐194.
   Liu, A.Y., Torchia, B.S., Migeon, B.R., and Siliciano, R.F. 1997. The human NTT gene: Identification of a novel 17‐kb noncoding nuclear RNA expressed in activated CD4+ T cells. Genomics 39:171‐184.
   Makalowska, I., Ryan, J., and Baxevanis, A. 1999. GeneMachine: A unified solution for performing content‐based, site‐based, and comparative gene prediction methods. 12th Cold Spring Harbor Meeting on genome mapping, sequencing and Biology Cold Spring Harbor, NY.
   Mural, R.J., Einstein, J.R., Guan, X., Mann, R.C., and Uberbacher, E.C. 1992. An artificial intelligence approach to DNA sequence feature recognition. Trends Biotech. 10:67‐69.
   Pearson, W.R., Wood, T., Zhang, Z., and Miller, W. 1997. Comparison of DNA sequences with protein sequences. Genomics 46:24‐36.
   Rogic, S., Mackworth, A., and Ouellette, B.F.F. 2001. Evaluation of Gene‐Finding Programs. In press.
   Snyder, E.E. and Stormo, G.D. 1993. Identification of coding regions in genomic DNA sequences: An application of dynamic programming and neural networks. Nucl. Acids Res. 21:607‐613.
   Snyder, E.E. and Stormo, G.D. 1997. Identifying genes in genomic DNA sequences. In DNA and Protein Sequence Analysis (M.J. Bishop and C.J. Rawlings eds.) pp. 209‐224. Oxford University Press, New York.
   Solovyev, V.V., Salamov, A.A., and Lawrence, C.B. 1994a. Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucl. Acids Res. 22:5156‐5163.
   Solovyev, V.V., Salamov, A.A., and Lawrence, C.B. 1994b. The prediction of human exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Ismb 2:354‐362.
   Solovyev, V.V., Salamov, A.A., and Lawrence, C.B. 1995. Identification of human gene structure using linear discriminant functions and dynamic programming. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology (C. Rawlings, D. Clark, R. Altman, L. Hunter, T. Langauer, and S. Wodak eds.) pp. 367‐375.
   Uberbacher, E.C. and Mural, R.J. 1991. Locating protein‐coding regions in human DNA sequences by a multiple sensor–neural network approach. Proc. Natl. Acad. Sci. U.S.A. 88:11261‐11265.
   Wevrick, R., Kerns, J.A., and Francke, U. 1996. The IPW gene is imprinted and is not expressed in the Prader‐Willi syndrome. Acta Genet. Med. Gemollol. 45:191‐197.
   Zhang, M.Q. 1997. Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc. Natl. Acad. Sci. U.S.A. 94:565‐568.
   Zhang, J. and Madden, T.L. 1997. PowerBLAST: A new network BLAST application for interactive or automated sequence analysis and annotation. Genome Res. 7:649‐656.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library