Computer Manipulation of DNA and Protein Sequences

J. Michael Cherry1

1 Stanford University, Palo Alto, California
Publication Name:  Current Protocols in Molecular Biology
Unit Number:  Unit 7.7
DOI:  10.1002/0471142727.mb0707s30
Online Posting Date:  May, 2001
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


This unit outlines a variety of methods by which DNA sequences can be manipulated by computers. Procedures for entering sequence data into the computer and assembling raw sequence data into a contiguous sequence are described first, followed by a description of methods of analyzing and manipulating sequences‐‐e.g., verifying sequences, constructing restriction maps, designing oligonucleotides, identifying protein‐coding regions, and predicting secondary structures. This unit also provides information on the large amount of software available for sequence analysis.The appendix to this unit lists some of the commercial software, shareware, and free software related to DNA sequence manipulation. The goal of this unit is to serve as a starting point for researchers interested in utilizing the tremendous sequencing resources available to the computer‐knowledgeable molecular biology laboratory.

PDF or HTML at Wiley Online Library

Table of Contents

  • Sequence Data Entry
  • Sequence Data Verification
  • Restriction Mapping
  • Prediction of Nucleic Acid Structure
  • Oligonucleotide Design Strategy
  • Identification of Protein‐Coding Regions
  • Homology Searching
  • Genetic Sequence Databases and Other Electronic Resources Available to Molecular Biologists
  • Literature Cited
  • Figures
  • Tables
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library


  •   FigureFigure 7.7.1 Commonly used sequence file formats having specific defined elements and defining codes. (A) EMBL comment lines begin with two‐letter codes: ID, short sequence name; DE, description; and SQ, sequence length. DNA or protein sequence follows; sequence end is denoted by two slashes (//) on a separate line. (B) GenBank comments precede the sequence and are separated from it by the code “ORIGIN”. Sequence end is denoted by two slashes on a separate line. The actual text ot this entry has been abbreviated; See Fig. 19.2.3 for a more complete example of a GenBank file. (C) GCG comments precede the sequence and are separated from it by two dots (..). (D) Intelligenetics comment lines begin with semicolons (;). A single description line follows, and then the sequence begins on a separate line. Sequence end is denoted by a numeral one (1). (E) NBRF (also called PIR format) first line starts with four required characters: a greater‐than sign (>); either “D” for DNA or “P” for protein; either “L” for linear or a “C” for circular; and a semicolon. The short sequence name follows on the same line. The next line is a description line. Sequence starts on a new line and its end is denoted by an asterisk (*). (F) DNA Strider Text is similar to the Intelligenetics format, but lacks the description line. (G) FASTA (sometimes called Pearson format) first line begins with a greater‐than sign (>), followed by the sequence name and a short description. Sequence data then starts on a separate line. Note: Some formats (including GenBank, GCG, and NBRF) allow numbers to be included within the sequence for ease of reading (the numbers are ignored during sequence analysis).
  •   FigureFigure 7.7.2 Multiple sequence editor. The GCG program GELASSEMBLE displays the aligned sequences on the top of the screen and a schematic of the sequenced fragments on the bottom. Arrows indicate the direction of sequencing; the asterisk in the lower part of the display indicates the position of the cursor in the sequence alignment as the user edits the sequence.
  •   FigureFigure 7.7.3 One type of graphical restriction map. This figure was produced by the free PlotZ program; GCG MAPPLOT produces similar output.
  •   FigureFigure 7.7.4 (A) Text‐based output from Zuker's RNA‐folding program, available from GCG under the name of FOLD. This type of representation is difficult to visualize, but acceptable when only a quick view of the possible folded structures is desired. (B) Graphic representation of the structure shown in part A,produced by the GCG. Squiggles program. The free LoopViewer program (for Macintosh) produces similar representations.
  •   FigureFigure 7.7.5 Text of a message sent to the EBI FASTA mail server (e‐mail address: ). This message requests that the sequence be searched against the Other Mammalian section of the EMBL database. The answer will include the top 100 matching sequences and alignments of the top 20 matching sequences.


Literature Cited

Literature Cited
   Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
   Church, G.M. and Kieffer‐Higgins, S. 1988. Multiplex DNA sequencing. Science. 240:185‐188.
   Feng, D.F. and Doolittle, R.F. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25:351‐360.
   Fickett, J. 1982. Recognition of protein coding regions in DNA sequences. Nucl. Acids Res. 10:5303‐5318.
   Freier, S.M., Kierzek, R., Jaeger, J.A., Sugimoto, N., Caruthers, M.H., Neilson, T., and Turner, D.H. 1986. Improved free‐energy parameters for predictions of RNA duplex stability. Proc. Natl. Acad. Sci. U.S.A. 83:9373‐9377.
   Gonnet, G.H., Cohen, M.A., and Benner, S.A. 1992. Exhaustive matching of the entire protein sequence database. Science. 256:1443‐1445.
   Henikoff, S. and Henikoff, J.G. 1993. Performance evaluation of amino acid substitution matrices. Proteins. 17:49‐61.
   Higgins, D.G. and Sharp, P.M. 1988. Clustal: A package for performing multiple sequence alignment on a microcomputer. Gene 73:237‐244.
   Higgins, D.G. and Sharp, P.M. 1989. Fast and sensitive multiple sequence alignments on a microcomputer. Comp. App. Biosci. 5:151‐153.
   Karlin, S. and Altschul, S.F. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U.S.A. 87:2264‐2268.
   Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443‐453.
   Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85:2444‐2448.
   Schuler, G.D., Altschul, S.F., and Lipman, D.J. 1991. A workbench for multiple alignment construction and analysis. Proteins Struct. Funct. Genet. 9:180‐190.
   Schwartz, R.M. and Dayhoff, M.O.(eds) 1978. Matrices for Detecting Distant Relationships: Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, D.C.
   Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195‐197.
   Turner, D.H., Sugimoto, N., Jaeger, J.A., Longfellow, C.E., Freier, S.M., and Kierzek, R. 1987. Improved parameters for prediction of RNA structure. Cold Spring Harbor Symp. Quant. Biol. 52:123‐133.
   Turner, D.H., Sugimoto, N., and Freier, S.M., and Kierzek, R. 1988. RNA structure prediction. Annu. Rev. Biophys. Chem. 17:167‐192.
   Wilbur, W.J. and Lipman, D.J. 1983. Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. U.S.A. 80:726‐730.
   Zuker, M. 1989a. On finding all suboptimal foldings of an RNA molecule. Science. 244:48‐52.
   Zuker, M. 1989b. The use of dynamic programming algorithms in RNA secondary structure prediction. In Mathematical Methods for DNA Sequences (M.S. Waterman, ed.) p. 159‐184. CRC Press, Boca Raton, Fla.
PDF or HTML at Wiley Online Library