Searching NCBI Databases Using Entrez

Gretchen Gibney1, Andreas D. Baxevanis1

1 null, Bethesda, Maryland
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 1.3
DOI:  10.1002/0471250953.bi0103s34
Online Posting Date:  June, 2011
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

One of the most widely used interfaces for the retrieval of information from biological databases is the NCBI Entrez system. Entrez capitalizes on the fact that there are pre‐existing, logical relationships between the individual entries found in numerous public databases. The existence of such natural connections, mostly biological in nature, argued for the development of a method through which all the information about a particular biological entity could be found without having to sequentially visit and query disparate databases. Two basic protocols describe simple, text‐based searches, illustrating the types of information that can be retrieved through the Entrez system. An alternate protocol builds upon the first basic protocol, using additional, built‐in features of the Entrez system, and providing alternative ways to issue the initial query. The support protocol reviews how to save frequently issued queries. Finally, Cn3D, a structure visualization tool, is also discussed. Curr. Protoc. Bioinform. 34:1.3.1‐1.3.25. © 2011 by John Wiley & Sons, Inc.

Keywords: Entrez; NCBI databases; biological databases; integrated information retrieval

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: Querying Entrez
  • Support Protocol 1: Using My NCBI to Save Searches and Results
  • Alternate Protocol 1: Combining Entrez Queries
  • Basic Protocol 2: Examining Structures in Entrez
  • Commentary
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

  •   FigureFigure 1.3.1 The Entrez unified results page, showing the number of hits to each of Entrez's component databases fitting the query. Clicking on any of the numbers to the left of the database name takes the user to the results found in that particular database.
  •   FigureFigure 1.3.2 Results of a text‐based Entrez query using Boolean operators against PubMed. The initial query (from Fig. ) is shown in the search box near the top of the window. Each entry gives the title of the paper, names of the authors, and the citation information. An individual record can be viewed by clicking on the hyperlinked title of that paper.
  •   FigureFigure 1.3.3 An example of a PubMed record in Abstract format, as returned through Entrez. This Abstract view is for the fourth reference shown in Figure . The view provides connections to related articles, sequence information, and the actual, full‐text journal article. See text for details.
  •   FigureFigure 1.3.4 Related citations for an entry found in PubMed. The original entry from Figure (Cho et al., ) is at the top of the list, indicating that this is the parent entry.
  •   FigureFigure 1.3.5 The Entrez Gene page for the DCC (deleted in colorectal carcinoma) gene. The screen shows that this is a protein‐coding gene and provides information on the genomic context of DCC and the encoded protein. An extensive collection of links to other NCBI and external databases is provided along the right‐hand side of the window. See text for details.
  •   FigureFigure 1.3.6 The dbSNP GeneView page for the DCC gene. The information on individual SNPs is shown in the table towards the bottom of the screen. Each SNP occupies two lines of the table, with one line showing the “contig reference” (the more common allele) and the other showing the SNP (the less common allele). For example, the first two rows in the table show a contig reference A for which there is a documented SNP, changing the A to a G. At the protein level, this changes the amino acid at position 3 of the DCC protein from asparagine to serine. The rows are colored red since this is a “nonsynonymous SNP;” that is, the SNP produces a discrete change at the amino acid level. In contrast, the fifth and sixth rows of the table are shown in green, indicating that this record is for a “synonymous SNP;” the entries describe a SNP where the contig reference (T) and the SNP allele (C) ultimately produce the same amino acid (Asp).
  •   FigureFigure 1.3.7 The RefSeq protein entry corresponding to the original Cho et al. () publication shown in Figure , in GenPept format. See text for details.
  •   FigureFigure 1.3.8 The OMIM entry for the DCC gene. Each entry includes information such as the gene symbol, alternate names for the disease, a description of the disease, a clinical synopsis, and references.
  •   FigureFigure 1.3.9 An example of a list of allelic variants that can be obtained through OMIM. The figure shows the four allelic variants for the DCC gene, two leading to cancers of the digestive tract and two that are associated with a movement disorder. The description under each allelic variant provides information specific to that particular mutation.
  •   FigureFigure 1.3.10 Gene Expression Omnibus (GEO) DataSets for the DCC gene. For each DataSet, a brief description of the experiment is provided, as well as a schematic of the gene expression profile derived in the study.
  •   FigureFigure 1.3.11 The MedlinePlus page devoted to information for both laymen and physicians on DCC and disorders related to DCC. The information available through this page is often much more appropriate to provide to patients, since the level of writing is geared towards nonprofessionals. Often, MedlinePlus entries include interactive tutorials for various procedures related to the disease of interest.
  •   FigureFigure 1.3.12 The ClinicalTrials.gov page showing actively recruiting clinical trials relating to colorectal neoplasms. Information on each trial, including the principal investigator of the trial and qualification criteria, can be found by clicking on the name of the trial.
  •   FigureFigure 1.3.13 Searches saved through My NCBI can be recalled, viewed, and updated through the Saved Searches option under the My Saved Data on the user's My NCBI page. See text for details.
  •   FigureFigure 1.3.14 Formulating a search against the nucleotide portion of Entrez. The initial query is shown in the text box near the top of the window (DNA‐binding), and the nucleotide entries matching the query are displayed below. See text for details.
  •   FigureFigure 1.3.15 Using the Limits feature of Entrez to limit a search to a particular organism. See text for details.
  •   FigureFigure 1.3.16 Results of a limited search against the nucleotide portion of Entrez. The initial query is shown in the text box near the top of the window (methanothermobacter), and the nucleotide entries matching the query are displayed below. Note the caution (!) icon next to the words Limits Activated at the top of the results page, indicating that the results displayed have been “limited,” here to a particular organism (Fig. ). See text for details.
  •   FigureFigure 1.3.17 Combining individual queries using the Advanced Search feature of Entrez. Each search performed in the last 8 hr is saved and given a number in Search History. The searches can be combined using the search numbers and the Boolean operators AND, OR, or NOT. See text for details.
  •   FigureFigure 1.3.18 Entries resulting from the combination of two individual Entrez queries. The query term producing the results is shown in the Search Box near the top of the window (#17 AND #18). The numbers correspond to those assigned to the previously performed searches listed in Figure . See text for details.
  •   FigureFigure 1.3.19 The structure summary for 1HMF, resulting from a direct query of the structures accessible through the Entrez system. The entry shows header information from the corresponding MMDB entry, links to PubMed, and links to the taxonomy of the source organism. Structure neighbors, as assessed by VAST, can be found by clicking on the long bar (purple on screen) next to the Protein key. The structure itself can be viewed by clicking on the Structure View in Cn3D button, thereby spawning the Cn3D viewer.
  •   FigureFigure 1.3.20 The structure of 1HMF rendered using Cn3D version 4.1, an interactive molecular viewer. Cn3D can be used as a helper application to any Web browser or as a stand‐alone application. In panel A, the backbone of the structure is shown as a worm, with the coloring indicating secondary structural regions; in this case, there are three α‐helices, shown in green, with a “crayon” indicating the length and directionality of the helix. Four residues have been highlighted in the sequence window, and those residues are shown in yellow in the structure window. In panel B, the rendering of the structure has been changed, showing the structure in space‐filling style, with the coloring being done by charge (red, negative; blue, positive). For both panels, the coloring shown in the structure window is mirrored in the sequence window below. See text for details.
  •   FigureFigure 1.3.21 Changing the rendering and coloring of selected parts of a structure. The Style Options window also allows individual residues to be numbered and the dimensions of side chains and other features to be changed. See text for details.
  •   FigureFigure 1.3.22 An overview of the relationships in the Entrez integrated information retrieval system. Each node represents one of the elements that can be accessed through Entrez, and the lines represent how each component database connects to the others. Entrez is under continuous evolution, with new components being added and the interrelationships between the elements changing dynamically. (Figure from The Entrez Search and Retrieval System, The NCBI Handbook; see Internet Resources.) A Flash‐based version of this figure can be found at http://www.ncbi.nlm.nih.gov/Database/datamodel/index.html.

Videos

Literature Cited

   Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
   Barrett, T., Suzek, T.O., Troup, D.B., Wilhite, S.E., Ngau, W.C., Ledoux, P., Rudnev, D., Lash, A.E., Fujibuchi, W., and Edgar, R. 2005. NCBI GEO: Mining millions of expression profiles—database and tools. Nucleic Acids Res. 33:D562‐D566.
   Cho, K.R., Oliner, J.D., Simons, J.W., Hedrick, L., Fearon, E.R., Preisinger, A.C., Hedge, P., Silverman, G.A., and Vogelstein, B. 1994. The DCC gene: Structural analysis and mutations in colorectal carcinomas. Genomics 19:525‐531.
   Gibrat, J.‐F., Madej, T., and Bryant, S. 1996. Surprising similarities in structure comparison. Curr. Opin. Struct. Biol. 6:377‐385.
   Hamosh, A., Scott, A.F., Amberger, J., Bocchini, C., Valle, D., and McKusick, V.A. 2002. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 30:52‐55.
   Madej, T., Gibrat, J.‐F., and Bryant, S. 1995. Threading a database of protein cores. Proteins 23:356‐369.
   McKusick, V.A. 1998. Online Mendelian inheritance in man: A catalog of human genes and genetic disorders, 12th Edition. The Johns Hopkins University Press, Baltimore, Maryland.
   Mullikin, J.C. and Sherry, S.T. 2005. Sequence polymorphisms. In Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 3rd Edition (A.D. Baxevanis and B.F.F. Ouellette, eds.) pp. 171‐193. John Wiley & Sons, Hoboken, New Jersey.
   Wilbur, W. and Coffee, L. 1994. The effectiveness of document neighboring in search enhancement. Inf. Process Manage. 30:253‐266.
   Wilbur, W. and Yang, Y. 1996. An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts. Comput. Biol. Med. 26:209‐222.
Internet Resources
   http://www.ncbi.nlm.nih.gov
  NCBI Home page.
   http://www.ncbi.nlm.nih.gov/Entrez
  NCBI Entrez Web page.
   http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml
  NCBI Cn3D structure viewer.
   http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.chapter.ch15
  Ostell, J. 2003. The Entrez Search and Retrieval System. The NCBI Handbook, Chapter 15. National Center for Biotechnology Information, Bethesda, Maryland.
   http://www.ncbi.nlm.nih.gov/projects/geo/info/overview.html
  NCBI GEO Overview.
  http://www.ncbi.nlm.nih.gov/RefSeq/
  NCBI Reference Sequence (RefSeq) Project.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library