Dealing with the Data Deluge: Handling the Multitude of Chemical Biology Data Sources

Rajarshi Guha1, Dac‐Trung Nguyen1, Noel Southall1, Ajit Jadhav1

1 NIH Center for Advancing Translational Science, Rockville, Maryland
Publication Name:  Current Protocols in Chemical Biology
Unit Number:   
DOI:  10.1002/9780470559277.ch110262
Online Posting Date:  September, 2012
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


Over the last 20 years, there has been an explosion in the amount and type of biological and chemical data that has been made publicly available in a variety of online databases. While this means that vast amounts of information can be found online, there is no guarantee that it can be found easily (or at all). A scientist searching for a specific piece of information is faced with a daunting task—many databases have overlapping content, use their own identifiers and, in some cases, have arcane and unintuitive user interfaces. In this overview, a variety of well‐known data sources for chemical and biological information are highlighted, focusing on those most useful for chemical biology research. The issue of using data from multiple sources and the associated problems such as identifier disambiguation are highlighted. A brief discussion is then provided on Tripod, a recently developed platform that supports the integration of arbitrary data sources, providing users a simple interface to search across a federated collection of resources. Curr. Protoc. Chem. Biol. 4:193‐209 © 2012 by John Wiley & Sons, Inc.

Keywords: database; federation; integration; Pubchem; HTS

PDF or HTML at Wiley Online Library

Table of Contents

  • Data Sources in the Life Sciences
  • Databases for Chemical Biology
  • Challenges in Dealing with Multiple Databases
  • Integration Vs. Federation
  • The Tripod Platform for Integrated Browsing
  • Summary
  • Literature Cited
  • Figures
  • Tables
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

   Apodaca, R. 2011. Sixty‐Four Free Chemistry Databases (http://depth‐‐four‐free‐chemistry‐databases/).
   Ban, T.A. 2006. The role of serendipity in drug discovery. Dialogues Clin. Neurosci. 8:335‐344.
   Belleau, F., Nolin, M.A., Tourigny, N., Rigault, P., and Morissette, J. 2008. Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. J. Biomed. Inform. 41:706‐716.
   Brennan, R.J., Nikolskya, T., and Bureeva, S. 2009. Network and pathway analysis of compound‐protein interactions. Methods Mol. Biol. 575:225‐247.
   Canny, S.A., Cruz, Y., Southern, M.R., and Griffin, P.R. 2012. PubChem promiscuity: A web resource for gathering compound promiscuity data from PubChem. Bioinformatics 28:140‐141.
   Chen, B., Dong, X., Jiao, D., Wang, H., Zhu, Q., Ding, Y., and Wild, D.J. 2010. Chem2Bio2RDF: A semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics 11:255.
   Dekker, F.J., Koch, M.A., and Waldmann, H. 2005. Protein structure similarity clustering (PSSC) and natural product structure as inspiration sources for drug development and chemical genomics. Curr. Opin. Chem. Biol. 9:232‐239.
   Fourches, D., Muratov, E., and Tropsha, A. 2010. Trust, but verify: On the importance of chemical structure curation in cheminformatics and QSAR modeling research. J. Chem. Inf. Model. 50:1189‐1204.
   Gao, Z., Li, H., Zhang, H., Liu, X., Kang, L., Luo, X., Zhu, W., Chen, K., Wang, X., and Jiang, H. 2008. PDTD: A web‐accessible protein database for drug target identification. BMC Bioinformatics 9:104.
   Halling‐Brown, M.D., Bulusu, K.C., Patel, M., Tym, J.E., and Al‐Lazikani, B. 2011. canSAR: An integrated cancer public translational research and drug discovery resource. Nucl. Acids Res. 40:D947‐D956.
   Hishigaki, H. and Kuhara, S. 2011. hERGAPDbase: A database documenting hERG channel inhibitory potentials and APD‐prolongation activities of chemical compounds. Databases bar017.
   Huang, R., Southall, N., Wang, Y., Yasgar, A., Shinn, P., Jadhav, A., Nguyen, D.‐T., and Austin, C.P. 2011. The NCGC pharmaceutical collection: A comprehensive resource of clinically approved drugs enabling repurposing and chemical genomics. Sci. Transl. Med. 3:80ps16.
   Jessop, D.M., Adams, S.E., Willighagen, E.L., Hawizy, L., and Murray‐Rust, P. 2011. OSCAR4: A flexible architecture for chemical text‐mining. J. Cheminf. 3:41.
   Karopka, T., Fluck, J., Mevissen, H.‐T., and Glass, A. 2006. The autoimmune disease database: A dynamically compiled literature‐derived database. BMC Bioinformatics 7:325.
   Klein, T.E., Chang, J.T., Cho, M.K., Easton, K.L., Fergerson, R., Hewett, M., Lin, Z., Liu, Y., Liu, S., Oliver, D.E., Rubin, D.L., Shafa, F., Stuart, J.M., and Altman, R.B. 2001. Integrating genotype and phenotype information: An overview of the PharmGKB project. Pharmacogenomics J. 1:167‐170.
   Kramer, C., Beck, B., Kriegl, J.M., and Clark, T. 2008. A composite model for hERG blockade. ChemMedChem 3:254‐265.
   Kuhn, M., Campillos, M., Letunic, I., Jensen, L.J., and Bork, P. 2010a. A side effect resource to capture phenotypic effects of drugs. Mol. Sys. Biol. 6:343.
   Kuhn, M., Szklarczyk, D., Franceschini, A., Campillos, M., von Mering, C., Jensen, L.J., Beyer, A., and Bork, P. 2010b. STITCH 2: An interaction network database for small molecules and proteins. Nucleic. Acids Res. 38:D552‐D556.
   Lachmann, A., Xu, H., Krishnan, J., Berger, S.I., Mazloom, A.R., and Ma'ayan, A. 2010. ChEA: Transcription factor regulation inferred from integrating genome‐wide ChIP‐X experiments. Bioinformatics 26:2438‐2444.
   McNaught, A. 2006. The IUPAC international chemical identifier: InChI. Chemistry Int. 28:12‐15.
   Metz, J.T. and Hajduk, P.J. 2010. Rational approaches to targeted polypharmacology: Creating and navigating protein‐ligand interaction networks. Curr. Opin. Chem. Biol. 14:498‐504.
   Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247:536‐540.
   Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., and Kanehisa, M. 1999. KEGG: Kyoto encyclopedia of genes and genomes. Nucl. Acids Res. 27:29‐34.
   Oprea, T.I., Tropsha, A., Faulon, J.‐L., and Rintoul, M.D. 2007. Systems chemical biology. Nat. Chem. Biol. 3:447‐450.
   Osborne, J.D., Flatow, J., Holko, M., Lin, S.M., Kibbe, W.A., Zhu, L.J., Danila, M.I., Feng, G., and Chisholm, R.L. 2009. Annotating the human genome with disease ontology. BMC Genomics 10:S6.
   Peri, S., Navarro, J.D., Amanchy, R., Kristiansen, T.Z., Jonnalagadda, C.K., Surendranath, V., Niranjan, V., Muthusamy, B., Gandhi, T.K.B., Gronborg, M., Ibarrola, N., Deshpande, N., Shanker, K., Shivashankar, H.N., Rashmi, B.P., Ramya, M.A., Zhao, Z., Chandrika, K.N., Padma, N., Harsha, H.C., Yatish, A.J., Kavitha, M.P., Menezes, M., Choudhury, D.R., Suresh, S., Ghosh, N., Saravana, R., Chandran, S., Krishna, S., Joy, M., Anand, S.K., Madavan, V., Joseph, A., Wong, G.W., Schiemann, W.P., Constantinescu, S.N., Huang, L., Khosravi‐Far, R., Steen, H., Tewari, M., Ghaffari, S., Blobe, G.C., Dang, C.V., Garcia, J.G.N., Pevsner, J., Jensen, O.N., Roepstorff, P., Deshpande, K.S., Chinnaiyan, A.M., Hamosh, A., Chakravarti, A., and Pandey, A. 2003. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 13:2363‐2371.
   Philippi, S. 2008. Data and knowledge integration in the life sciences. Brief. Bioinform. 9:451.
   Philippi, S. and Kohler, J. 2006. Addressing the problems with life‐science databases for traditional uses and systems biology. Nat. Rev. Genetics 7:482‐488.
   Posner, B.A., Xi, H., and Mills, J.E.J. 2009. Enhanced HTS hit selection via a local hit rate analysis. J. Chem. Inf. Model. 49:2202‐2210.
   Reitz, A.B., Smith, G.R., Tounge, B.A., and Reynolds, C.H. 2009. Hit triage using efficiency indices after screening of compound libraries in drug discovery. Curr. Top. Med. Chem. 9:1718‐1724.
   Robas, N., O'Reilly, M., Katugampola, S., and Fidock, M. 2003. Maximizing serendipity: Strategies for identifying ligands for orphan G‐protein‐coupled receptors. Curr. Opin. Pharmacol. 3:121‐126.
   Samwald, M., Jentzsch, A., Bouton, C., Kallesoe, C.S., Willighagen, E., Hajagos, J., Marshall, M.S., Prud'hommeaux, E., Hassenzadeh, O., Pichler, E., and Stephens, S. 2011. Linked open drug data for pharmaceutical research and development. J. Cheminf. 3:19.
   Schriml, L.M., Arze, C., Nadendla, S., Chang, Y.‐W.W., Mazaitis, M., Felix, V., Feng, G., and Kibbe, W.A. 2012. Disease ontology: A backbone for disease semantic integration. Nucl. Acids Res. 40:D940‐D946.
   Simmons, K., Kinney, J., Owens, A., Kleier, D., Bloch, K., Argentar, D., Walsh, A., and Vaidyanathan, G. 2008. Comparative study of machine‐learning and chemometric tools for analysis of in‐vivo high‐throughput screening data. J. Chem. Inf. Model. 48:1663‐1668.
   Soh, D., Dong, D., Guo, Y., and Wong, L. 2010. Consistency, comprehensiveness, and compatibility of pathway databases. BMC Bioinformatics 11:449.
   Steinbeck, C. and Kuhn, S. 2004. NMRShiftDB—Compound identification and structure elucidation support through a free community‐build web database. Phytochemistry 65:2711‐2717.
   Su, A.I. and Hogenesch, J.B. 2007. Power‐law‐like distributions in biomedical publications and research funding. Genome Biol. 8:404.
   Sun, H. 2006. An accurate and interpretable bayesian classification model for prediction of hERG liability. ChemMedChem 1:315‐322.
   Tarcea, V.G., Weymouth, T., Ade, A., Bookvich, A., Gao, J., Mahavisno, V., Wright, Z., Chapman, A., Jayapandian, M., Ozgur, A., Tian, Y., Cavalcoli, J., Mirel, B., Patel, J., Radev, D., Athey, B., States, D., and Jagadish, H.V. 2009. Michigan molecular interactions R2: From interacting proteins to pathways. Nucl. Acids Res. 37:D642‐D646.
   Thorn, C.F., Klein, T.E., and Altman, R.B. 2005. PharmGKB: The pharmacogenetics and pharmacogenomics knowledge base. Methods Mol. Biol. 311:179‐191.
   Wall, D.P., Pivovarov, R., Tong, M., Jung, J.‐Y., Fusaro, V.A., DeLuca, T.F., and Tonellato, P.J. 2010. Genotator: A disease‐agnostic tool for genetic annotation of disease. BMC Med. Genomics 3:50.
   Wang, R., Fang, X., Lu, Y., and Wang, S. 2004. The PDBbind database: Collection of binding affinities for protein‐ligand complexes with known three‐dimensional structures. J. Med. Chem. 47:2977‐2980.
   Warr, W. 2011. Representation of chemical structures. WIREs Comp. Mol. Sci. 1:557‐579.
   Wishart, D.S., Knox, C., Guo, A.C., Cheng, D., Shrivastava, S., Tzur, D., Gautam, B., and Hassanali, M. 2008. DrugBank: A knowledgebase for drugs, drug actions and drug targets. Nucl. Acids Res. 36:D901‐D906.
   Wishart, D.S., Knox, C., Guo, A.C., Shrivastava, S., Hassanali, M., Stothard, P., Chang, Z., and Woolsey, J. 2006. DrugBank: A comprehensive resource for in silico drug discovery and exploration. Nucl. Acids Res. 34:D668‐D672.
   Wu, C.H., Yeh, L.‐S.L., Huang, H., Arminski, L., Castro‐Alvear, J., Chen, Y., Hu, Z., Kourtesis, P., Ledley, R.S., Suzek, B.E., Vinayaka, C.R., Zhang, J., and Barker, W.C. 2003. The protein information resource. Nucl. Acids Res. 31:345‐347.
   Yan, S.F., Asatryan, H., Li, J., and Zhou, Y. 2005. Novel statistical approach for primary high‐throughput screening hit selection. J. Chem. Inf. Model. 45:1784‐1790.
Internet Resources
  The Tripod Web site.
  NPC Browser.
  Fragment Activity Profiler.
  NCGC chemical structure standardizer.
  Gene Literature Novelty Score.
  NCI Chemical Structure Lookup Service.
PDF or HTML at Wiley Online Library