Protein Structural Domains: Definition and Prediction

Iakes Ezkurdia1, Michael L. Tress1

1 Spanish National Cancer Research Centre (CNIO)—Structural Biology and Biocomputing Programme, Madrid, Spain
Publication Name:  Current Protocols in Protein Science
Unit Number:  Unit 2.14
DOI:  10.1002/0471140864.ps0214s66
Online Posting Date:  November, 2011
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


Recognition and prediction of structural domains in proteins is an important part of structure and function prediction. This unit lists the range of tools available for domain prediction, and describes sequence and structural analysis tools that complement domain prediction methods. Also detailed are the basic domain prediction steps, along with suggested strategies for different protein sequences and potential pitfalls in domain boundary prediction. The difficult problem of domain orientation prediction is also discussed. All the resources necessary for domain boundary prediction are accessible via publicly available Web servers and databases and do not require computational expertise. Curr. Protoc. Protein Sci. 66:2.14.1‐2.14.16. © 2011 by John Wiley & Sons, Inc.

Keywords: structural domains; domain parsing; homology modeling; ab initio predictions; functional domains

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • What are Structural Domains?
  • How Structural Domains are Defined
  • Predicting Structural Domains
  • Initial Steps in Identifying Protein Domains
  • Methods for Domain Prediction
  • Evaluating Domain Predictors
  • Domain‐Domain Interactions
  • Potential Problems
  • Literature Cited
  • Figures
  • Tables
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

Literature Cited
   Alexandrov, N. and Shindyalov, I. 2003. PDP: Protein domain parser. Bioinformatics 19:429‐430.
   Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
   Attwood, T.K. 2002. The PRINTS database: A resource for identification of protein families. Briefings Bioinformat. 3:252‐263.
   Bendtsen, J.D., Nielsen, H., von Heijne, G., and Brunak, S. 2004. Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol. 340:783‐795.
   Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The protein data bank. Nucleic Acids Res. 28:235‐242.
   Bernsel, A., Viklund, H., Hennerdal, A., and Elofsson, A. 2009. TOPCONS: Consensus prediction of membrane protein topology. Nucleic Acids Res. 37:W465‐W468.
   Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O'Donovan, C., Phan, I., Pilbout, S., and Schneider, M. 2003. The SWISS‐PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31:365‐370.
   Bryson, K., McGuffin, L.J., Marsden, R.L., Ward, J.J., Sodhi, J.S., and Jones, D.T. 2005. Protein structure prediction servers at University College London. Nucleic Acids Res. 33:W36‐W38.
   Cheng, J. 2007. DOMAC: An accurate, hybrid protein domain prediction server. Nucleic Acids Res. 35:W354‐W356.
   Cheng, J., Sweredoski, M.J., and Baldi, P. 2006. DOMpro: Protein domain prediction using profiles, secondary structure, relative solvent accessibility, and recursive neural networks. Data Mining Knowl. Disc. 13:1‐10.
   Chivian, D., Kim, D.E., Malmström, L., Bradley, P., Robertson, T., Murphy, P., Strauss, C.E., Bonneau, R., Rohl, C.A., and Baker, D. 2003. Automated prediction of CASP‐5 structures using the Robetta server. Proteins 53:524‐533.
   Chothia, C. and Janin, J. 1975. Principles of protein‐protein recognition. Nature 256:705‐708.
   Chothia, C. 1992. One thousand families for the molecular biologist. Nature 357:543‐544.
   Chothia, C., Gough, J., Vogel, C., and Teichmann, S.A. 2003. Evolution of the protein repertoire. Science 300:1701‐1703.
   Coggill, P., Finn, R.D., and Bateman, A. 2008. Identifying protein domains with the Pfam database. Curr. Protoc. Bioinform. 23:2.5.1‐2.5.17.
   Cole, C., Barber, J.D., and Barton, G.J. 2008. The Jpred 3 secondary structure prediction server. Nucleic Acids Res. 36:W197‐W201.
   Contreras‐Moreira, B., and Bates, P.A. 2002. Domain Fishing: A first step in protein comparative modeling. Bioinformatics 18:1141‐1142.
   Dhir, S., Pacurar, M, Franklin, D, Gáspári, Z, Kertész‐Farkas, A, Kocsor, A, Eisenhaber, F, and Pongor, S. 2010. Detecting atypical examples of known domain types by sequence similarity searching: The SBASE domain library approach. Curr. Protein Peptide Sci. 11:538‐549.
   Dosztányi, Z., Csizmok, V., Tompa, P., and Simon, I. 2005. IUPred: Web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21:3433‐3434.
   Dumontier, M., Yao, R., Feldman, H.J., and Hogue, C.W. 2005. Armadillo: Domain boundary prediction by amino acid composition. J. Mol. Biol. 350:1061‐1073.
   Eickholt, J., Deng, X., and Cheng, J. 2011. DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning. BMC Bioinform. 12:43.
   Eswar, N., Webb, B., Marti‐Renom, M.A., Madhusudhan, M., Eramian, D., Shen, M.‐y., Pieper, U., and Sali, A. 2007. Comparative protein structure modeling using MODELLER. Curr. Protoc. Protein Sci. 50:2.9.1‐2.9.31.
   Ezkurdia, I., Graña, O., Izarzugaza, J.M., and Tress, M.L. 2009. Assessment of domain boundary predictions and the prediction of intramolecular contacts in CASP8. Proteins 77:S196‐S209.
   Finn, R.D. Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E.L., Eddy, S.R., and Bateman, A. 2009. The Pfam protein families database. Nucleic Acids Res. 38:D211‐D222.
   Fiser, A. and Sali, A. 2003. Modeller: Generation and refinement of homology‐based protein structure models. Methods Enzymol. 374:461‐491.
   Galzitskaya, O.V. and Melnik, B.S. 2003. Prediction of protein domain boundaries from sequence alone. Protein Sci. 12:696‐701.
   George, D.G., Dodson, RJ, Garavelli, JS, Haft, DH, Hunt, LT, Marzec, CR, Orcutt, BC, Sidman, KE, Srinivasarao, GY, Yeh, LS, Arminski, LM, Ledley, RS, Tsugita, A, and Barker, WC. 1997. The protein information resource (PIR) and the PIR‐international protein sequence database. Nucleic Acids Res. 25:24‐28.
   George, R.A. and Heringa, J. 2002. SnapDRAGON: A method to delineate protein structural domains from sequence data. J. Mol. Biol. 316:839‐851.
   Gracy, J. and Argos, P. 1998. Automated protein sequence database classification. II. Delineation of domain boundaries from sequence similarities. Bioinformatics 14:174‐187.
   Hadley, C. and Jones, D T. 1999. A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Structure 7:1099‐1112.
   Henikoff, S., Henikoff, J.G., and Pietrokovski, S. 1999. Blocks+: A non‐redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics 15:471‐479.
   Holland, T.A., Veretnik, S., Shindyalov, I.N., and Bourne, P.E. 2006. Partitioning protein structures into domains: Why is it so difficult? J. Mol. Biol. 361:562hyphen;590.
   Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L., Duquenne, L., Finn, R.D., Gough, J., Haft, D., Hulo, N., Kahn, D., Kelly, E., Laugraud, A., Letunic, I., Lonsdale, D., Lopez, R., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Mulder, N., Natale, D., Orengo, C., Quinn, A.F., Selengut, J.D., Sigrist, C.J., Thimma, M., Thomas, P.D., Valentin, F., Wilson, D., Wu, C.H., and Yeats, C. 2009. InterPro: The integrative protein signature database. Nucleic Acids Res. 37:D211‐D215.
   Inbar, Y., Benyamini, H., Nussinov, R., and Wolfson, H.J. 2003. Protein structure prediction via combinatorial assembly of sub‐structural units. Bioinformatics 19:i158‐i168.
   Ishida, T. and Kinoshita, K. 2008. Prediction of disordered regions in proteins based on the meta approach. Bioinformatics 24:1344‐1348.
   Islam, S.A., Luo, J., and Sternberg, M.J. 1995. Identification and analysis of domains in proteins. Protein Engin. 8:513‐525.
   Jones, D.T. 2007. Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics 23:538‐544.
   Kaminska, K.H., Baraniak, U., Boniecki, M., Nowaczyk, K., Czerwoniec, A., and Bujnicki, J.M. 2008. Structural bioinformatics analysis of enzymes involved in the biosynthesis pathway of the hypermodified nucleoside ms(2)io(6)A37 in tRNA. Proteins 70:1‐18.
   Kelley, L.A. and Sternberg, M.J.E. 2009. Protein structure prediction on the Web: A case study using the Phyre server. Nat. Protoc. 4:363‐371.
   Kobe, B., Guss, M., and Huber, T. 2008. Structural Proteomics: High‐Throughput Methods, 1st ed., Humana Press, Totowa, N.J.
   Krishnamurthy, N. and Sjölander, K. V. 2005. Basic protein sequence analysis. Curr. Protoc. Protein Sci. 41:2.11.1‐2.11.24.
   Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L. 2001. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 305:567‐580.
   Kurowski, M.A. and Bujnicki, J.M. 2003. GeneSilico protein structure prediction meta‐server. Nucleic Acids Res. 31:3305‐3307.
   Letunic, I., Doerks, T., and Bork, P. 2009. SMART 6: Recent updates and new developments. Nucleic Acids Res. 37:D229‐D232.
   Levitt, M. and Chothia, C. 1976. Structural patterns in globular proteins. Nature 261:552‐558.
   Liu, J. and Rost, B. 2003. Domains, motifs and clusters in the protein universe. Curr. Opin. Chem. Biol. 7:5‐11.
   Lupas, A., Van Dyke, M., and Stock, J. 1991. Predicting coiled coils from protein sequences. Science 252:1162‐1164.
   Magrane, M. and Consortium, U. 2011. UniProt Knowledgebase: A hub of integrated protein data. Database 2011:bar009.
   Majumdar, I., Kinch, L.N., and Grishin, N.V. 2009. A database of domain definitions for proteins with complex interdomain geometry. PLoS One 4:e5084.
   Marchler‐Bauer, A., Anderson, J.B., Chitsaz, F., Derbyshire, M.K., DeWeese‐Scott, C., Fong, J.H., Geer, L.Y., Geer, R.C., Gonzales, N.R., Gwadz, M., He, S., Hurwitz, D.I., Jackson, J.D., Ke, Z., Lanczycki, C.J., Liebert, C.A., Liu, C., Lu, F., Lu, S., Marchler, G.H., Mullokandov, M., Song, J.S., Tasneem, A., Thanki, N., Yamashita, R.A., Zhang, D., Zhang, N., and Bryant, S.H. 2009. CDD: Specific functional annotation with the Conserved Domain Database. Nucleic Acids Res. 37:D205‐D210.
   Mi, H., Dong, Q., Muruganujan, A., Gaudet, P., Lewis, S., and Thomas, P.D. 2010. PANTHER version 7: Improved phylogenetic trees, orthologs and collaboration with the Gene Ontology Consortium. Nucleic Acids Res. 38:D204‐D210.
   Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247:536‐540.
   Nikolskaya, A.N., Arighi, C.N., Huang, H., Barker, W.C., and Wu, C.H. 2006. PIRSF family classification system for protein functional and evolutionary analysis. Evol. Bioinform. Online 2:197‐209.
   Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. 1997. CATH—a hierarchic classification of protein domain structures. Structure 5:1093‐1108.
   Peng, K., Radivojac, P., Vucetic, S., Dunker, A.K., and Obradovic, Z. 2006. Length‐dependent prediction of protein intrinsic disorder. BMC Bioinform. 7:208.
   Petsko, G.A. 2006. An introduction to modeling structure from sequence. Curr. Protoc. Bioinform. 15:5.1.1‐5.1.3.
   Richardson, J.S. 1981. The anatomy and taxonomy of protein structure. Adv. Protein Chem. 34:167‐339.
   Roy, A., Kucukural, A., and Zhang, Y. 2010. I‐TASSER: A unified platform for automated protein structure and function prediction. Nat. Protoc. 5:725‐738.
   Saini, H.K. and Fischer, D. 2005. Meta‐DP: Domain prediction meta‐server. Bioinformatics 21:2917‐2920.
   Sanchez‐Pulido, L., Valencia, A., and Rojas, A.M. 2007. Are promyelocytic leukaemia protein nuclear bodies a scaffold for caspase‐2 programmed cell death? Trends Biochem. Sci. 32:400‐406.
   Servant, F., Bru, C., Carrère, S., Courcelle, E., Gouzy, J., Peyruc, D., and Kahn, D. 2002. ProDom: Automated clustering of homologous domains. Briefings Bioinform. 3:246‐251.
   Shimizu, K., Hirose, S., and Noguchi, T. 2007. POODLE‐S: Web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position‐specific scoring matrix. Bioinformatics 23:2337‐2338.
   Shiozawa, K., Maita, N., Tomii, K., Seto, A., Goda, N., Akiyama, Y., Shimizu, T., Shirakawa, M., and Hiroaki, H. 2004. Structure of the N‐terminal domain of PEX1 AAA‐ATPase. Characterization of a putative adaptor‐binding domain. J. Biol. Chem. 279:50,060‐50,068.
   Sigrist, C.J.A., Cerutti, L., de Castro, E., Langendijk‐Genevaux, P.S., Bulliard, V., Bairoch, A., and Hulo, N. 2010. PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res. 38:D161‐D166.
   Söding, J. 2005. Protein homology detection by HMM‐HMM comparison. Bioinformatics 21:951‐960.
   Stormo, G.D. 2011. An introduction to recognizing functional domains. Curr. Protoc. Bioinform. 34:2.1.1‐2.1.6.
   Suyama, M. and Ohara, O. 2003. DomCut: Prediction of inter‐domain linker regions in amino acid sequences. Bioinformatics 19:673‐674.
   Tai, C.‐H., Lee, W.J., Vincent, J.J., and Lee, B. 2005. Evaluation of domain prediction in CASP6. Proteins 61:S183‐S192.
   Tatusov, R.L., Galperin, M.Y., Natale, D.A., and Koonin, E.V. 2000. The COG database: A tool for genome‐scale analysis of protein functions and evolution. Nucleic Acids Res. 28:33‐36.
   Taylor, W.R. 1999. Protein structural domain identification. Protein Engin. 12:203‐216.
   Terashi, G., Takeda‐Shitaka, M., Kanou, K., Iwadate, M., Takaya, D., Hosoi, A., Ohta, K., and Umeyama, H. 2007. Fams‐ace: A combined method to select the best model after remodeling all server models. Proteins 69:S98‐S107.
   Tovchigrechko, A. and Vakser, I.A. 2006. GRAMM‐X public web server for protein‐protein docking. Nucleic Acids Res. 34:W310‐W314.
   Tress, M., Cheng, J., Baldi, P., Joo, K., Lee, J., Seo, J.H., Lee, J., Baker, D., Chivian, D., Kim, D., and Ezkurdia, I. 2007. Assessment of predictions submitted for the CASP7 domain prediction category. Proteins 69:S137‐S151.
   Veretnik, S., Bourne, P.E., Alexandrov, N.N., and Shindyalov, I.N. 2004. Toward consistent assignment of structural domains in proteins. J. Mol. Biol. 339:647‐678.
   Wallner, B. and Elofsson, A. 2005. Pcons5: Combining consensus, structural evaluation and fold recognition scores. Bioinformatics 21:4248‐4254.
   Ward, J.J., McGuffin, L.J., Bryson, K., Buxton, B.F., and Jones, D.T. 2004. The DISOPRED server for the prediction of protein disorder. Bioinformatics 20:2138‐2139.
   Wheelan, S.J., Marchler‐Bauer, A., and Bryant, S H. 2000. Domain size distributions can predict domain boundaries. Bioinformatics 16:613‐618.
   Wilson, D., Pethica, R., Zhou, Y., Talbot, C., Vogel, C., Madera, M., Chothia, C., and Gough, J. 2009. SUPERFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res. 37:D380‐D386.
   Wolf, E., Kim, P.S., and Berger, B. 1997. MultiCoil: A program for predicting two‐ and three‐stranded coiled coils. Protein Sci. 6:1179‐1189.
   Xu, D. and Xu, Y. 2000. Protein tertiary structure prediction. Curr. Protoc. Protein Sci. 19:2.7.1‐2.7.17.
   Yeats, C., Lees, J., Reid, A., Kellam, P., Martin, N., Liu, X., and Orengo, C. 2008. Gene3D: Comprehensive structural and functional annotation of genomes. Nucleic Acids Res. 36:D414‐D418.
PDF or HTML at Wiley Online Library