Microarray Expression Data Analysis References

Return to Statistical Genomics References

Introduction to Statistical Genomics Issues with Microarray Data

Newton MA, Yandell BS, Shavlik J, Craven M (2001)

The dimension and complexity of raw gene expression data obtained by oligonucleotide chips, spotted arrays, or whatever technology is used, create challenging data analysis and data management problems. In a limited way these challenges can be met by existing software systems and analysis methods in the hands of end users. However, we are convinced that a much more active scientific endeavor is called for. We anticipate that, broadly defined, bioinformatics will encompass statistical and biometrical questions of experimental design, data analysis, graphics and modeling, and computational questions concerning efficient algorithms for various learning tasks such as classification and clustering.

Microarray data can be analysed using several approaches (Claverie, 1999). Clustering methods (i.e. unsupervised learning) are used widely and have the ability to uncover coordinated expression patterns from a collection of microarrays (e.g., Eisen et al. 1998; Getz et al. 2000; Tibshirani et al. 2000; Dudoit, Fridlyand et al. 2000; Kerr and Churchill 2000a). The use of standard clustering methods is most appropriate when the microarrays arise from some common source cell type, for example from a common tissue type from animals in some controlled cross. Refinements may be necessary when other sources of variation affect the microarrays (van der Laan and Bryan 2000). Classification methods (i.e. supervised learning) have proven very useful to identify patterns of gene expression that can be correlated with qualitative disease phenotypes (e.g. Golub et al. 1999) and for classifying genes according to their functional role (Brown et al. 2000). Related methods of multivariate statistical analysis, such as those using the singular value decomposition (Alter et al. 2000; West et al. 2000) or multidimensional scaling can be effective at reducing the dimension of the objects under study.

Statistical methods are emerging to account for multiple sources of variation when trying to pool information from many microarrays and to identify genes exhibiting significant differential expression between cell types. One approach is to decompose the appropriately transformed expression measurement as a linear combination of effects from different sources of variation (Kerr et al. 2000). This is basically ANOVA for microarrays. In the context of a two-group comparison with replication Dudoit, Yang et al. (2000) have proposed the use of permutation-testing and p-value adjustment to account for the multiple-testing problem. Lin et al. (2001) describe a nonparametric method suited to uncovering differential expression for low-abundance transcripts. Alternatively, the mixture-model approach can be used to directly assay the probability that a given gene is truly expressed (Lee et al. 2000) or the probability that a gene is truly differentially expressed between two conditions (Newton et al. 2001; Efron et al. 2001). The functional patterns of expression identified by such statistical calculations will be backed up by laboratory examination to verify findings (cf. Nadler et al. 2000).

Although analysis methods have been a central concern in most bioinformatics research to date, the issue of experimental design is critical. The use of replication, for example, in controlled experiments can significantly improve power to uncover differentially expressed genes (Kerr and Churchill 2000b, Lee et al. 2000). Our internal review of requests for microarray support will include careful examination of experimental design considerations.

Microarray analysis typically uses background-adjusted expression intensities, (PM-MM for Affymetrix chips). However, this can create problems with negative adjusted values, since the log-transform is often applied to these adjusted values. This has prompted ad hoc procedures (cf. Roberts et al. 2000). However, arbitrary handling of low expression genes is unsatisfactory since these may be the most interesting, e.g. transcription factors and receptors. Instead Lin et al. (2001) advocated an approximate normal scores transformation of background-adjusted expression which allows the use of all data (see also Efron et al. 2001). These normal scores appear to have better properties for clustering, and are well behaved for inference on differential expression.

Patterns of gene expression evinced by data analysis is only the beginning. In many cases, greater biological understanding can be attained by using expression data in conjunction with sequence data (Craven et al. 2000), pathway data (Zien et al. 2000), and biomedical text sources (Shatkay et al. 2000). It may in addition involve constructing predictive models from diverse data sources (Craven et al. 2000), and developing automated methods for exploiting text and Web data (Craven and Kumlien, 1999; Shavlik et al. 1999).

Return to Statistical Genomics References.

Microarray Data Overview

Lockhart et al. (1996); Ermolaeva et al. (1998); Bassett, Eisen, Boguski (1999); Brown, Botstein (1999); Claverie (1999); Duggan, Bittner, Chen, Meltzer, Trent (1999); Lander ES (1999); Lipshutz, Fodor, Gingeras, Lockhart (1999); Zhang MQ (1999); Lockhart, Winzeler (2000)

Biology Implications of Microarray Data Analysis

Spellman et al. (1998); Lee, Klopp, Weindruch, Prolla (1999); Richmond et al. (1999) ; Nadler et al. (2000); Soukas, Cohen, Socci, Friedman (2000)'

Exploiting Bioinformatics with Microarrays

Craven, Kumlien (1999); Shavlik et al. (1999); Craven et al. (2000); Shatkay et al. (2000); Zien et al. (2000); Eisen, Chiang, Brown (2001); Liu (2001)

Microarray Image Analysis

Schadt et al. (1999); Yang et al. (2000); Yang et al. (2000); Brown, Goodwin, Sorger (2001); Li, Wong (2001); Schadt et al. (2001); Stuart, Bush, Nigam (2001); Irizarray et al. (2003); Zhang, Miles, Aldape (2003)

Differential Expression: Comparing Two Conditions

Baldi, Long (2001); de Risi et al. (1996); Chen, Dougherty, Bittner (1997); Hughes et al. (2000); Cui, Churchill (2003); Long et al. (2001) ; Newton et al. (2001); Roberts et al. (2000); Theilhaber et al. (1999); Woolf, Wang (2000);

Differential Expression: Anova and Experimental Design

Dudoit et al. (2000); Kerr, Churchill (2000b); Kim et al. (2000); Lee et al. (2000); Kerr, Churchill (2001); Kerr, Martin, Churchill (2001); Efron et al. (2001); Lin et al. (2001); Mills, Gordon (2001); Pan, Lin, Le (2001) Pan, Lin, Le (2001) Thomas et al. (2001); Wolfinger et al. (2001); Zhao, Prentice, Breeden (2001); Lonnstedt, Speed (2002); Newton et al. (2003)

Clustering and Classification Methods

Eisen et al. (1998); Golub et al. (1999); Tamayo et al. (1999); Tibshirani et al. (1999); Brown et al. (2000); Dudoit, Fridlyand, Speed (2000) Getz, Levine, Domany (2000); Hastie et al. (2000); Hastie et al. (2000); Lazzeroni, Owen (2000); Spang et al. (2000); van der Laan, Bryan (2000) Barash, Friedman (2001); Ben-Dor, Friedman, Yakhini (2001); Kerr, Churchill (2001); Pan, Lin, Le (2001); Smith et al. (2001); West et al. (2001); Yeung et al. (2001); Pollard, van der Laan (2002); Jörnsten, Yu (2003); Kluger et al. (2003)

Principal Components/Singular Value Decomposition

Hilsenbeck et al. (1999); Wittes, Friedman (1999); Alter, Brown, Botstein (2000); West (2000); West et al. (2000); Fellenberg et al. (2001); Holter et al. (2001); Wall (2001); Kim, Tidor (2003)

Expression Networks

Return to Statistical Genomics References.

Brian Yandell (yandell@stat.wisc.edu)