Gene Identification in Prokaryotic Genomes, Phages, Metagenomes, and EST Sequences with GeneMarkS Suite

Mark Borodovsky1, Alex Lomsadze1

1 Georgia Institute of Technology, Atlanta, Georgia
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 4.5
DOI:  10.1002/0471250953.bi0405s35
Online Posting Date:  September, 2011
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


This unit describes how to use several gene‐finding programs from the GeneMark line developed for finding protein‐coding ORFs in genomic DNA of prokaryotic species, in genomic DNA of eukaryotic species with intronless genes, in genomes of viruses and phages, and in prokaryotic metagenomic sequences, as well as in EST sequences with spliced‐out introns. These bioinformatics tools were demonstrated to have state‐of‐the‐art accuracy and have been frequently used for gene annotation in novel nucleotide sequences. An additional advantage of these sequence‐analysis tools is that the problem of algorithm parameterization is solved automatically, with parameters estimated by iterative self‐training (unsupervised training). Curr. Protoc. Bioinform. 35:4.5.1‐4.5.17. © 2011 by John Wiley & Sons, Inc.

Keywords: gene finding; hidden Markov model; unsupervised parameter estimation

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: Using GeneMarkS
  • Basic Protocol 2: Using GeneMark.hmm for Prokaryotic Gene Prediction
  • Basic Protocol 3: Using GeneMark for Prokaryotic Gene Prediction
  • Basic Protocol 4: Using the Heuristic Approach for Prokaryotic Model Building
  • Basic Protocol 5: Using MetaGeneMark for Finding Genes in Metagenomes
  • Guidelines for Understanding Results
  • Commentary
  • Literature Cited
  • Figures
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

Literature Cited
   Besemer, J. and Borodovsky, M. 1999. Heuristic approach to deriving models for gene finding. Nucleic Acids Res. 27:3911‐3920.
   Besemer, J., Lomsadze, A., and Borodovsky, M. 2001. GeneMarkS: A self‐training method for prediction of gene starts in microbial genomes: Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29:2607‐2618.
   Borodovsky, M. and McIninch, J. 1993. GENMARK: Parallel gene recognition for both DNA strands. Comput. Chem. 17:123‐133.
   Borodovsky, M., Sprizhitsky, Yu., Golovanov, E., and Alexandrov, A. 1986a. Statistical patterns in primary structures of functional regions in the E. coli genome: I. Oligonucleotide frequencies analysis. Mol. Biol. 20:826‐833.
   Borodovsky, M., Sprizhitsky, Y., Golovanov, E., and Alexandrov, A. 1986b. Statistical patterns in primary structures of functional regions in the E. coli genome: II. Non‐homogeneous Markov models. Mol. Biol. 20:833‐840.
   Borodovsky, M., Sprizhitsky, Y., Golovanov, E., and Alexandrov, A. 1986c. Statistical patterns in primary structures of functional regions in the E. coli genome: III. Computer recognition of coding regions. Mol. Biol. 20:1145‐1150.
   Borodovsky, M., Rudd, K., and Koonin, Eu. 1994a. Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. Nucleic Acids Res. 22:4756‐4767.
   Borodovsky, M., Koonin, Eu., and Rudd, K. 1994b. New genes in old sequences: A strategy for finding genes in a bacterial genome. Trends Biochem. Sci. 19:309‐313.
   Borodovsky, M., McIninch, J., Koonin, E., Rudd, K., Medigue, C., and Danchin, A. 1995. Detection of new genes in the bacterial genome using Markov models for three gene classes. Nucleic Acids Res. 23:3554‐3562.
   Bult, C.J., White, O., Olsen, G.J., Zhou, L., Fleischmann, R.D., Sutton, G.G., Blake, J.A., FitzGerald, L.M., Clayton, R.A., Gocayne, J.D., Kerlavage, A.R., Dougherty, B.A., Tomb, J.‐F., Adams, M.D., Reich, C.I., Overbeek, R., Kirkness, E.F., Weinstock, K.G., Merrick, J.M., Glodek, A., Scott, J.L., Geoghagen, N.S.M., Weidman, J.F., Fuhrmann, J.L., Nguyen, D., Utterback, T.R., Kelley, J.M., Peterson, J.D., Sadow, P.W., Hanna, M.C., Cotton, M.D., Roberts, K.M., Hurst, M.A., Kaine, B.P., Borodovsky, M., Klenk, H.‐P., Fraser, C.M., Smith, H.O., Woese, C.R., and Venter, J.C. 1996. Complete genome sequence of the methanogenic archaeon Methanococcus jannaschii. Science 273:1058‐1073
   Durbin, R., Eddy, S., Krough, A., and Mitchison, G. 1998. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, U.K.
   Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.‐F., Dougherty, B.A., Merrick, J.M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C.A., Gocayne, J.D., Scott, J.D., Shirley, R., Liu, L.‐I., Glodek, A., Kelley, J.M., Weidman, J.F., Phillips, C.A., Spriggs, T., Hedblom, E., Cotton, M.D., Utterback, T.R., Hanna, M.C., Nguyen, D.T., Saudek, D.M., Brandon, R.C., Fine, L.D., Fritchman, J.L., Fuhrmann, J.L., Geoghagen, N.S.M., Gnehm, C.L., McDonald, L.A., Small, K.V., Fraser, C.M., Smith, H.O., and Venter, J.C. 1995. Whole‐genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496‐512.
   Fraser, C.M., Gocayne, J.D., White, O., Adams, M.D., Clayton, R.A., Fleischmann, R.D., Bult, C.J., Kerlavage, A.R., Sutton, G., Kelley, J.M., Fritchman, J.L., Weidman, J.F., Small, K.V., Sandusky, M., Fuhrmann, J.L., Nguyen, D.T., Utterback, T.R., Saudek, D.M., Phillips, C.A., Merrick, J.M., Tomb, J.‐F., Dougherty, B.A., Bott, K.F., Hu, P.‐C., Lucier, T.S., Peterson, S.N., Smith, H.O., Hutchison, C.A. III, and Venter, J.C. 1995. The minimal gene complement of Mycoplasma genitalium. Science 270:397‐403.
   Hayes, W. and Borodovsky, M. 1998. How to interpret anonymous genome? Machine learning approach to gene identification. Genome Res. 8:1154‐1171.
   Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., and Wootton, J.C. 1993. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262:208‐214.
   Lukashin, A.V. and Borodovsky, M. 1998. GeneMark.hmm: New solutions for gene finding. Nucleic Acids Res. 26:1107‐1115.
   Mills, R., Rozanov, M., Lomsadze, A., Tatusova, T., and Borodovsky, M. 2003. Improving gene annotation in complete viral genomes. Nucleic Acids Res. 31:7041‐7055.
   Tatusov, R.L., Mushegian, A.R., Bork, P., Brown, N.P., Hayes, W., Borodovsky, M., Rudd, K.E., and Koonin, E.V. 1996. Metabolism and evolution of H. influenzae deduced from whole genome comparison to E. coli. Curr. Biol. 6:279‐291.
   Tomb, J., White, O., Kerlavage, A.R., Clayton, R.A., Sutton, G.G., Fleischmann, R.D., Ketchum, K.A., Klenk, H.P., Gill, S., Dougherty, B.A., Nelson, K., Quackenbush, J., Zhou, L., Kirkness, E.F., Peterson, S., Loftus, B., Richardson, D., Dodson, R., Khalak, H.G., Glodek, A., McKenney, K., Fitzegerald, L.M., Lee, N., Adams, M.D., Hickey, E.K., Berg, D.E., Gocayne, J.D., Utterback, T.R., Peterson, J.D., Kelley, J.M., Cotton, M.D., Weidman, J.M., Fujii, C., Bowman, C., Watthey, L., Wallin, E., Hayes, W.S., Borodovsky, M., Karp, P.D., Smith, H.O., Fraser, C.M., and Venter, J.C. 1997. The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature 388:539‐547
   Zhu, W., Lomsadze, A., and Borodovsky, M. 2010. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 38:e132.
PDF or HTML at Wiley Online Library

Supplementary Material