Analysis and Management of Microarray Gene Expression Data

Gregory R. Grant1, Elisabetta Manduchi1, Christian J. Stoeckert1

1 University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania
Publication Name:  Current Protocols in Molecular Biology
Unit Number:  Unit 19.6
DOI:  10.1002/0471142727.mb1906s77
Online Posting Date:  January, 2007
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


Microarray experiments require careful planning and choice of analysis tools in order to get the most out of the data generated, especially considering the associated significant cost and effort. Microarray experiments also require careful documentation, often residing in local databases and/or submitted to public repositories. An often bewildering assortment of choices is available for experimental design, data preprocessing, data analysis (e.g., differential gene expression, classification), and data management. This unit covers the basic steps and common applications for planning, data processing, and data management of microarray experiments, and provides guidance to making choices based on the goals and practical realities of the experiment, as well as the authors' experience in this area.

Keywords: microarray; experimental design; data preprocessing; data analysis; databases; gene expression

PDF or HTML at Wiley Online Library

Table of Contents

  • Experimental Design
  • Data Preprocessing
  • Expression and Differential Expression
  • Classification
  • Looking at Gene Sets
  • Databases
  • Conclusions
  • Literature Cited
  • Figures
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

   Allison, D.B., Cui, X., Page, G.P., and Sabripour, M. 2006. Microarray data analysis: From disarray to consolidation and consensus. Nat. Rev. Genet. 7:55‐65.
   Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., and Levine, A.J. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. U.S.A. 96:6745‐6750.
   Bailey, T.B. and Elkan, C. 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28‐36. AAAI Press, Menlo Park, Calif.
   Barash, Y., Dehan, E., Krupsky, M., Franklin, W., Geraci, M., Friedman, N., and Kaminski, N. 2004. Comparative analysis of algorithms for signal quantitation from oligonucleotide microarrays. Bioinformatics 20:839‐846.
   Bar‐Joseph, Z. 2004. Analyzing time series gene expression data. Bioinformatics 20:2493‐2503.
   Ben‐Dor, A., Shamir, R., and Yakhini, Z. 1999. Clustering gene expression patterns. J. Comput. Biol. 6:281‐297.
   Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B Met. 57:289‐300.
   Bolstad, B.M., Irizarry, R.A., Åstrand, M., and Speed, T.P. 2003. A comparison of normalization methods for high‐density oligonucleotide array data based on variance and bias. Bioinformatics 19:185‐193.
   Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C.P., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze‐Kremer, S., Stewart, J., Taylor, R., Vilo, J., and Vingron, M. 2001. Minimum information about a microarray experiment (MIAME): Toward standards for microarray data. Nat. Genet. 29:365‐371.
   Breiman L., 2001. Random forests. Machine Learning 45:5‐32.
   Breiman, L., Friedman, J.H., Olshen, R., and Stone, C.J. 1984. Classification and Regression Trees. The Wadsworth Statistics/Probability Series. Wadsworth International Group, Belmont, Calif.
   Choe, S.E., Boutros, M., Michelson, A.M., Church, G.M., and Halfon, M.S. 2005. Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biology 6:R16.
   Cleveland, W.S. 1979. Robust locally weighted regression and smoothing scatterplots. J. Am. Stat. Assoc. 74:829‐836.
   Cleveland, W.S. and Devlin, S.J. 1988. Locally weighted regression: An approach to regression analysis by local fitting. J. Am. Stat. Assoc. 83:596‐610.
   Coombes, K.R., Highsmith, W.E., Krogmann, T.A., Baggerly, K.A., Stivers, D.N., and Abruzzo, L.V. 2002. Identifying and quantifying sources of variation in microarray data using high‐density cDNA membrane arrays. J. Comp. Biol. 9:655‐669.
   Coombes, K.R., Wang, J., and Abruzzo, L.V. 2003. Monitoring the quality of microarray experiments. In Methods of Microarray Data Analysis III (K.F. Johnson and S.K. Lin, eds.) pp. 25‐40. Kluwer Academic Publishers, Boston.
   Cope, L.M., Irizarry, R.A., Jaffee, H., Wu, Z., and Speed, T.P. 2003. A benchmark for Affymetrix GeneChip expression measures. Bioinformatics 20:323‐331.
   Dabney, A.R. and Storey, J.D. 2005. Optimal Feature Selection for Nearest Centroid Classifiers, With Applications to Gene Expression Microarrays. UW Biostatistics Working Paper Series. Working Paper 267. The Berkeley Electronic Press, Berkeley, Calif.
   D'haeseleer, P., Liang, S., and Somogyi, R. 2000. Genetic network inference: From co‐expression clustering to reverse engineering. Bioinformatics 16:707‐726.
   Durbin, R., Eddy, S.R., Krogh, A., and Mitchison, G. 1999. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, U.K.
   Dudoit, S., Fridlyand, J., and Speed, T.P. 2002. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97:77‐87.
   Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome‐wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95:14863‐14868.
   Ewens, W.J. and Grant, G.R. 2005. Statistical Methods in Bioinformatics: An Introduction, 2nd ed. Springer‐Verlag, New York.
   Fasulo, D. 1999. An analysis of recent work on clustering algorithms. Technical Report TR 0103‐02, University of Washington, Deptartment of Computer Science & Engineering, Seattle.
   Frank, E., Hall, M., Trigg, L., Holmes, G., and Witten, I.H. 2004. Data mining in bioinformatics using WEKA. Bioinformatics 20:2479‐2481.
   Furlanello, C., Serafini, M., Merler, S., and Jurman, G. 2003. Entropy‐based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics 4:54.
   Gardiner‐Garden, M. and Littlejohn, T.G. 2001. A comparison of microarray databases. Brief Bioinform. 2:143‐158.
   Ge, Y., Dudoit S., and Speed, T.P. 2003. Resampling‐based multiple testing for microarray data hypothesis. Test 12:1‐44.
   Grant, G.R., Manduchi, E., Pizarro, A., and Stoeckert, C.J. Jr. 2003. Maintaining data integrity in microarray data management. Biotechnol. Bioeng. 84:795‐800.
   Grant, G.R., Liu, J., and Stoeckert, C.J. Jr., 2005. A practical false discovery rate approach to identifying patterns of differential expression in microarray data. Bioinformatics 21:2684‐2690.
   Handl, J., Knowles, J., and Kell, D.B. 2005. Computational cluster validation in post genomic data analysis. Bioinformatics 21:3201‐3212.
   Hartigan, J. 1975. Clustering Algorithms. Wiley, Chichester, U.K.
   Hastie, T., Tibshirani, R., Eisen, M.B., Alizadeh, A., Levy, R., Staudt, L., Chan, W.C., Botstein, D., and Brown, P. 2003. Gene shaving as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol. 1:research0003.1‐research0003.21.
   Hollander, M. and Wolfe, D.A. 1999. Nonparametric Statistical Methods, 2nd ed. Wiley Interscience, New York.
   Huber, W., von Heydebreck, A., Sültmann, H., Poustka, A., and Vingron, M. 2002. Variance stabilization applied to microarray data calibration and to quantification of differential expression. Bioinformatics 18:S96‐S104.
   Irizarry, R.A., Wu, Z., and Jaffee, H.A. 2006. Comparison of affymetrix GeneChip expression measures. Bioinformatics 22:789‐794.
   Kerr, K. and Churchill, G.A. 2001. Experimental design for gene expression microarrays. Biostatistics 2:183‐202.
   Kerr, M.K., Martin, M., and Churchill, G.A. 2000. Analysis of variance for gene expression microarray data. J. Comput. Biol. 7:819‐837.
   Lazzeroni, L.C. and Owen, A. 2002. Plaid models for gene expression data. Statistica Sinica 12:61‐86.
   Liu, H. 2005. Evolving feature selection. IEEE Intelligent Systems 20:64‐76.
   Manduchi, E., Grant, G.R., He, H., Liu, J., Mailman, M.D., Pizarro, A.D., Whetzel, P.L., and Stoeckert, C.J. Jr. 2004. RAD and the RAD Study‐Annotator: An approach to collection, organization and exchange of all relevant information for high‐throughput gene expression studies. Bioinformatics 20:452‐459.
   Mootha, V.K., Lindgren, C.M., Eriksson, K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., Houstis, N., Daly, M., Patterson, N., Mesirov, J.P., Golub, T.R., Tamayo, P., Spiegelman, B., Lander, E.S., Hirschhorn, J.N., Altshuler, D., and Group, L.C. 2003. PGC‐1α‐responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 34:267‐273.
   Quackenbush, J. 2002. Microarray data normalization and transformation. Nat. Genet. 32:496‐501.
   Rabiner, L.A. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE Inst. Electr. Electron. Eng. 77:257‐286.
   Rajagopalan, D. 2003. A comparison of statistical methods for analysis of high density oligonucleotide array data. Bioinformatics 19:1469‐1476.
   Ramoni, M., Sebastiani, P., and Kohane, I. 2002. Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. U.S.A. 99:9121‐9126.
   Saeed, A.I., Sharov, V., White, J., Li, J., Liang, W., Bhagabati, N., Braisted, J., Klapa, M., Currier, T., Thiagarajan, M., Sturn, A., Snuffin, M., Rezantsev, A., Popov, D., Ryltsov, A., Kostukovich, E., Borisovsky, I., Liu, Z., Vinsavich, A., Trush, V., and Quackenbush, J. 2003. TM4: A free, open‐source system for microarray data management and analysis. Biotechniques 34:374‐378.
   Scearce, L.M., Brestelli, J.E., McWeeney, S.K., Lee, C.S., Mazzarelli, J., Pinney, D.F., Pizarro, A., Stoeckert, C.J. Jr, Clifton, S.W., Permutt, M.A., Brown, J., Melton, D.A., and Kaestner, K.H. 2002. Functional genomics of the endocrine pancreas. The pancreas clone set and PanChip, new resources for diabetes research. Diabetes 51:1997‐2004.
   Schliep, A., Schönhuth, A., and Steinhoff, C. 2003. Using hidden Markov models to analyze gene expression time course data. Bioinformatics 19:i255‐i263.
   Segal, E., Yelensky, R., and Koller, D. 2003. Genome‐wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics 19:273‐282.
   Sherlock, G. and Ball, C.A. 2005. Storage and retrieval of microarray data and open source microarray database software. Mol. Biotechnol. 30:239‐251.
   Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., and Futcher, B. 1998. Comprehensive identification of cell cycle‐regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9:3273‐3297.
   Speed T.P. (ed.) 2003. Statistical Analysis of Microarray Gene Expression Data. Chapman & Hall/CRC, Boca Raton, Fla.
   Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., and Levy, S. 2005. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21:631‐643.
   Steibel, J.P. and Rosa, G.J.M. 2005. On reference designs for microarray experiments. Stat. Appl. Genet. Mol. Biol. 4(1):Article 36.
   Stivers, D., Wang, J., Rosner G., and Coombes, K. 2003. Organ specific differences in gene expression and unigene annotations describing source material. In Methods of Microarray Data Analysis III (K.F. Johnson and S.K. Lin, eds.) pp. 59‐72. Kluwer Academic Publishers, Boston.
   Storey, J.D. 2003. The positive false discovery rate: A Bayesian interpretation and the q‐value. Ann. Stat. 31:2013‐2035.
   Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E., and Golub, T. 1999. Interpreting patterns of gene expression with self‐organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. U.S.A. 96:2907‐2912.
   Tamhane, A.C. and Dunlop, D.D. 2000. Statistics and Data Analysis. Prentice Hall, Upper Saddle River, N.J.
   Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. 2002. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. U.S.A. 99:6567‐6572.
   Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. 2003. Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat. Sci. 18:104‐117.
   Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R.B. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17:520‐525.
   Tusher, V.G., Tibshirani, R., and Chu, G. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A. 98:5116‐5121.
   Vapnik, V. 1998. Statistical Learning Theory. Wiley Interscience, New York.
   Westfall, P.H. and Young, S.S. 1993. Resampling‐based multiple testing. Wiley InterScience, New York.
   Whetzel, P.L., Parkinson, H., Causton, H.C., Fan, L., Fostel, J., Fragoso, G., Game, L., Heiskanen, M., Morrison, N., Rocca‐Serra, P., Sansone, S.A., Taylor, C., White, J., and Stoeckert, C.J. Jr. 2006. The MGED Ontology: A resource for semantics‐based description of microarray experiments. Bioinformatics. 22:866‐873.
   Witten, I.H., and Frank, E. 2005. Data mining: Practical machine learning tools and techniques, 2nd Edition. Morgan Kaufmann, San Francisco.
   Yang, Y.H. and Speed, T.P. 2002. Design issues for cDNA microarray experiments. Nat. Rev. Genet. 3:279‐588.
   Yang, Y.H., Buckley, M.J., Dudoit, S., and Speed, T.P. 2002a. Comparison of methods for image analysis on cDNA microarray data. J. Computat. Graph. Stat. 11:108‐136.
   Yang, Y.H, Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J., and Speed, T.P. 2002b. Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucl. Acids Res. 30:e15.
   Yekutieli, D. and Benjamini, Y. 1999. Resampling‐based false discovery rate controlling multiple test procedures correlated test statistics. J. Stat. Plan. Inference 82:171‐196.
Internet Resources
  Affymetrix statistical algorithms reference guide (MAS 5.0).
  Affymetrix Data Analysis Fundamentals Manual.
  The ArrayExpress repository
  The AlignACE Web site.
  ArrayVision image analysis software.
  The BASE ‐ BioArray Software Environment Web site.
  The BIOBASE Web site, which includes TRANSFAC.
  The Bioconductor project Web site.
  The CAGED Web site.
  DAVID Web site at NIAID, where EASE is also available.
  dChip software, with links to references and tutorials.
  The ELPH Gibbs Sampler Web site.
  The Entrez Gene Web site.
  The GenBank Web site.
  The Gene Expression Omnibus (GEO) repository.
  The Gene Ontology Project Web site.
  GenePix image analysis software.
  The Gene Set Enrichment Analysis (GSEA) Web site.
  The GenMAPP project Web site.
  The GUS (Genomics Unified Schema) Web site containing the RAD software.‐bin/jaspar2005/
  The JASPAR Web site.
  The KEGG Pathway database.
  The MEME Web site.
  The MGED Web site with links for MIAME, MAGE, and MGED Ontology.
  The MIAMExpress annotation and submission tool.
  The MySQL open source database Web site.
  NIST handbook, section on LOESS.
  The PaGE Web site.
  The PAM Web site.
  The PostgreSQL open source database Web site.
  L. Breiman and A. Cutler Web site on Random Forests.
  RMAExpress Web site, which includes relevant references.
  The SAM Web site
  ScanAlyze image analysis software.
  Spot image analysis software.
  The Stanford Microarray Database (SMD) software download site.
  The Stanford Microarray Database (SMD).
  The Stratagene Web site.
  TM4: a package of open source software for microarray analysis, comprising MADAM, SpotFinder, MIDAS, and MeV.
  Technical report on normalization for two‐channel arrays by Y.H. Yang, S. Dudoit, P. Luu, and T.P. Speed.
  The Weeder Web site.
  The WEKA web resource.
PDF or HTML at Wiley Online Library