Improved Statistical Test for Multiple-Condition Multiple-Condition Microarray Data Integrating ANOVA wit Clustering
!ared "riedman
#
Tis researc was done at Te $oc%efeller &niversity and supervised 'y (rian )ir% *
#+ #, .ast .nd Ave/ New 0or%/ N0/ N0/ ##*,+ .mail1 vicvic2att+net+ vicvic2att+ net+ 3one1 *#**45-#64 *+ .mail1 )ir%(2$oc%efeller+edu+ )ir%(2$oc%efeller+edu+ 3one1 *#*-6*7-784
Summary #
Microarrays are e9citing new 'iological 'iological instruments+ Microarrays promise to to ave important applications in many fields of 'iological researc/ 'ut tey are currently fairly inaccurate instruments and tis inaccuracy increases teir e9pense an d reduces teir usefulness+ Tis researc introduces a new tecni:ue for analy;ing data from some types of microarray e9periments tat promises to produce more accurate results+ .ssentially/ .ssentially/ te metod wor%s 'y identifying patterns in te genes+
A'stract Motivation1 Statistical significance significance analysis of microarray data is currently a pressing pressing pro'lem+ A multitude of statistical statistical tecni:ues ave 'een proposed/ and in te case of simple two condition e9periments tese tecni:ues wor% well/ 'ut in te case of multiple condition e9periments tere is additional additional information tat none of tese tecni:ues ta%e into account+ Tis information is te sape of te e9pression vector for eac gene/ i+e+/ te ordered set of e9pression measurements/ and its usefulness lies in te fact tat genes tat are actually affected 'iologically 'y some e9perimental circumstance tend to fall into a relatively small num'er of clusters+ clusters+
Introduction Microarrays are part of an e9citing new class of 'iotecnologies tat promise to allow monitoring of genome-wide e9pression levels >#?+ Altoug te tecni:ue presented in tis tis paper is :uite general and would/ in principle/ wor% wit oter types of data/ it as 'een developed for microarray data+ Microarray e9periments wor% 'y testing e9pression levels of some set of genes in organisms e9posed to at least two different conditions/ and usually ten trying to determine wic >if any? of tese genes are actually canging in response to te cange *
in e9perimental condition >*?+ &nfortunately/ &nfortunately/ te random variation in microarrays is often large large relative to te canges trying to 'e detected+ Tis ma%es it very ard to eliminate false positives/ positives/ genes wose random fluctuations ma%e tem erroneously e rroneously seem to respond to te e9perimental condition/ and false negatives/ genes wose random fluctuations mas% te fact tat tey are actually responding to te e9perimental condition+ condition+ It is standard statistical statistical practice to to andle tis pro'lem 'y using replicates/ i+e+/ several microarrays run in identical settings settings >#-5?+ Altoug replicates are igly effective/ te e9pense of additional replicates ma%es it wortwile to minimi;e te num'er needed 'y using more effective statistical tests+
=en tere are only two conditions con ditions in an e9periment/ te conventional statistical test for differential e9pression is te t-test >*/8?+ (ut @ong/ et. al. >/ al. >/ also 8/7? reali;ed tat 'y running a separate t-test on eac gene/ valua'le information is lost/ namely te fact tat te actual population standard deviation of all te genes is rater similar+ similar+ Due to te small num'er of replicates in microarray e9periments tere are usually a few gene s tat/ 'y cance/ app ear to ave e9tremely low standard deviations and tus register even a very small cange as igly significant 'y a conventional t-test+ In reality -- and as additional replicates would confirm -- te te standard deviations of tese genes are muc iger/ and te cange is not at all significant+ In response to tis issue/ @ong et. al. derived using (ayesian statistics a metod of transforming standard deviations tat/ in essence/ moves outlying standard deviations closer to te mean standard deviation+ &nfortunately/ &nfortunately/ te resulting algoritm/ wic tey call Cy'er-T/ Cy'er-T/ wor%s only wit two condition e9periments+ e9periments+ Since tis researc is concerned wit multiple condition e9periments/ it was necessary to e9pand te metod to wor% wit tis type of e9periment/ te standard statistical test for wic is called ANOVA ANOVA >ANalysis Of Variations? Variations? >,?+ Bere we report te successful e9tension of te Cy'er-T algoritm to multiple condition e9periments/ creating te algoritm we call Cy'er-ANOVA+
In microarray e9periments involving several conditions/ te e9pression vector of eac gene formed 'y consecutive measurements of its e9pression level as a distinctive sape/ wic contains useful information+ information+ Te more conditions in te e9periment e9periment >and tus num'ers in te e9pression vector?/ te more distinctive will 'e te te sape+ Often 'iologists are interested interested in finding genes wit e9pression vectors of similar sape/ and tis interest as gene rated a well6
developed process for grouping genes 'y sape sape %nown as clustering >5?+ Te 'asis of clustering clustering tecni:ues is te concept of correlation/ te similarity 'etween two vectors+ =e cose to use te 3earson Correlation/ 'ecause tis measure generally captures more accurately te actual 'iological idea of correlation >*?+ =it =it te 3earson correlation/ as elsewere/ distance is defined as one minus te correlation correlation and measures te dissimilarity dissimilarity 'etween two vectors+ &sing tis measure of distance/ it is possi'le to solve te pro'lem of ta%ing a large num'er of unorgani;ed vectors and placing tem into clusters of similar sapes1 tis is %nown as )-means clustering >#*?+ Once te )-means algoritm as 'een completed/ it is often useful useful to determine ow well eac gene as clustered+ Tis is done 'y computing a representative vector/ often called a center/ center/ for eac cluster and calculating te distance 'etween eac gene and te center of tat genes cluster >#4/#?+
Te fundamental tesis of tis researc is tat genes ac tually affected 'y e9perimental circumstances tend to fall into a relatively small num'er of clusters/ and tat tis information can 'e used to ma%e a more accurate statistical test+ =ile it is in principle possi'le/ it is is not often te case tat large num'ers of affected genes 'eave in seemingly random and entirely dissimilar patterns+
#? and !iang et al+/ >##?+ (ot studies ran )-Means clustering clustering on multiple condition data and found several distinct cluster sapes tat accounted for all clearly significant genes+
=e used tis connection 'etween clustering and significance to create a more accu rate statistical test wose essential steps are te following+ "irst Cy'er-ANOVA Cy'er-ANOVA is run on te entire dataset and te igly significant significant genes are clustered+ Te oter genes are placed into te te cluster in wic tey fit 'est/ and te distance 'etween eac gene and te center of its cluster is computed+ Tere are now two values for eac gene/ te 3 value >pro'a'ility? >pro'a'ility? from te Cy'erANOVA ANOVA test test and te D value >distance? from te clustering+ Tese two values are com'ined and te resulting value is more accurate tan eiter one alone+ 4
Te potential advantage of tis algoritm was validated 'y te creation and successful testing of a large num'er of artificial artificial data sets+ Artificial data was cosen over real e9perimental data for two main reasons1 >#? in e9perimental data/ uncertainty always e9ists over wic genes are actually differentially e9pressedE tis presents maFor pro'lems in determining ow well some given algoritm wor%sE and >*? no single microarray e9periment is actually representative of all possi'le microarray e9periments G to determine accurately te performance advantage of an algoritm/ a large num'er of very different e9periments must 'e used/ a proi'itively e9pensive and difficult process+ Artificial creation of data sets allows every parameter parameter of te data to 'e perfectly controlled and te algoritms algoritms performance can 'e tested on every possi'le com'ination of parameters+
Te 'asic metod of analysis was to generate an artificial data set from a num'er of of parameters/ run te clustering algoritm on te data and measure its performance/ run Cy'erCy'erANOVA ANOVA on tat data and on similar data 'ut 'u t wit additional replicates and measure its performance/ and ten compare te results+ Te performance gain of te clustering algoritm is pro'a'ly 'est e9pressed as e:uivalent replicate gain1 te num'er of additional replicates tat would 'e needed to produce te results of te clustering algoritm using a standard Cy'erANOVA+ ANOVA+ On reasona'le simulated data/ we found tat te 'enefit was appro9imately one additional replicate added to an original original two+ =it =it additional improvements to te algoritm algoritm design/ tis saving could pro'a'ly 'e increased still furter/ furter/ and te advantage is clearly significant enoug to warrant use in actual microarray e9periments+
Metods Algoritm Design Te algoritm is implemented as a large >H* lines? macro script in Microsoft Visual Visual (asic for Applications controlling Microsoft .9cel+ Te essential steps of te algoritm are te following1 #+ $un Cy'er-ANOVA Cy'er-ANOVA on all genesE *+ AdFust 3 values for multiple multiple testingE
6+ &se te results of >*? to coose an appropriate num'er of most significant genesE 4+ Cluster tis selection of genes using )-MeansE + Ta%e te remaining genes and group eac into te cluster wit wic it correlates 'estE 8+ Compute te distance 'etween eac gene and te center of its clusterE 7+ Compute a com'ined measurement measurement of significance significance for eac gene+
#+ $un Cy'er-ANOV Cy'er-ANO VA on all genes+ Cy'er-ANOV Cy'er-ANO VA is Fust a simple e9tension of Cy'er-T Cy' er-T// 'ut its novelty causes it to warrant some discussion+ Cy'er-T ta%es as input a list list of genes/ means in eac condition/ standard deviations in eac condition/ condition/ and num'er of replicates+ It gives as output a set of 3-values 3-values muc li%e tose from a conventional t-test/ 'ut more accurate+ accurate+ Te improvement is 'ased on te idea of regulari;ing te standard deviations/ essentially moving outlying standard deviations closer to te mean+ At te same time/ it recogni;es tat standard deviation may cange significantly >and inversely? wit intensity level/ and tus genes are regulari;ed only wit genes of similar intensity+ intensity+ Te algoritm wor%s 'y first computing for eac gene a 'ac%ground standard deviation/ te average standard deviation of te undred genes wit closest intensity/ intensity/ and ten com'ining te 'ac%ground standard deviation d eviation wit te o'served standard deviation in a type of weigted average+ A standard standard t-test is ten run/ 'ut using te regulari;ed standard deviation instead of te o'served one+
Cy'er-ANOVA Cy'er-ANOVA wor%s 'y te same metod/ ta%ing as input means and standard deviations in any num'er of conditions+ conditions+ .ac 'ac%ground standard deviation is calculated using using te intensities of te genes in teir own con dition/ and te values are com'ined using te same formula+ (ut 'ecause tere are more tan two two conditions/ a t-test cannot 'e appliedE instead we use ANOVA/ inputting te regulari;ed standard deviation instead of te o'served one+ Te result gives a dramatic increase in accuracy/ an e:uivalent replicate gain of rougly one+
*+ AdFust 3 values for multiple testing+ To run te clustering algoritm/ we need to select a num'er of genes tat are igly significant to form te seed clusters around wic all oter genes will cluster+ cluster+ Te difficult step in te process is deciding deciding ow many genes to select1 it is critical to get at least a few seeds for all clusters 'ut also critical not to ave too many false 8
positives+ Since te first re:uirement is impossi'le impossi'le to determine 'y calculation/ we use te second+ In order to coose an appropriate appropriate num'er of e9pected false positives/ positives/ we must ave an accurate estimate of te false positive rate+ Te simplest way of estimating te false positive rate is a (onferroni correction/ wic simply multiplies eac p value 'y te total num'er of tests+ Tis metod assumes tat all genes are independent/ and tends to 'e too conservative/ particularly wit small num'ers of replicates/ replicates/ and a num'er of more accurate tecni:ues ave 'een proposed+ Bowever/ for tis tis step/ only a roug estimate is necessary/ necessary/ and tus te (onferroni correction suffices+
6+ &se te results of >*? to coose an appropriate appropriate num'er of most significant significant genes+ Once te e9pected false positive rate for any num'er of te most significant genes as 'een estimated/ some num'er of genes must 'e selected suc tat te num'er of e9pected e9pe cted false positives is some percentage of te num'er selected genes+ Tere is no particular correct way to define tis percentageE it depends on ow significant te genes in te least significant cluster are/ wic cannot 'e determined directly+ directly+ Bowever/ we ave found tat te algoritm is tolerant tolerant to deviations from optimality >Ta'le >Ta'le 8?+
4+ Cluster tis selection selection of genes using )-Means+ Te cosen group of most significant significant genes is clustered using using te )-Means algoritm wit wit te 3earson correlation+ Te primary pro'lem wit all )-means clustering algoritms algoritms is tat te num'er of clusters must 'e declared at te start >#6?+ >#6?+ (ut li%e te determination determination of te num'er of genes to include in te initial clustering/ tere is no pro'lem as long as ) is rater ig+ Tis can 'e e9plained conceptually 'y considering wat appens as ) increases to differentially e9pressed and non-differentially e9pressed genes+ If tere are only differentially differentially e9pressed genes in te original set/ ten as ) is raised/ genes tat were originally placed in one cluster will separate into multiple clusters+ "ortunately/ tis is guaranteed 'y te process of )-Means to lower te D values and create 'etter cluster sapes+ If tere are non-differentially non-differentially e9pressed genes/ ten as ) is increased tese genes will tend to separate out from te differentially e9pressed genes and form teir own clusters G a igly desira'le occurrence+ &nfortunately/ &nfortunately/ if ) 'ecomes very large/ te noise genes may start to cluster very well wit oter noise genes forming erroneously low distances/ and te oter noise genes added in te ne9t step will ave a 'etter cance of incidentally finding a cluster wit 7
wic tey correlate well+ In principle/ too large large a ) migt 'e partially partially monitored 'y cec%ing for very small clusters and for clusters wit very ig average distances/ 'ot of wic could 'e removed and teir genes forced to Foin a different cluster+ cluster+ Tese metods ave not yet 'een implemented+
+ Ta%e Ta%e te remaining genes and group eac into te cluster wit wic it correlates 'est+ Te clusters formed in in step 4 act as seeds for te remaining genes+ Te clusters formed optimally include all of te cluster sapes of actually differentially e9pressed genes and no oters+ It was possi'le to form tese tese clusters/ even if tey were less tan ideal/ 'y using only a su'set of te genes in te data set+ Now it is possi'le to loo% at te remaining genes and to see ow well eac 'elongs in one of te clusters+ Ideally/ Ideally/ eac differentially e9pressed gene would correlate perfectly wit e9actly one cluster and eac noise gene would correlate poorly wit all of tem+ In practice/ of course/ tis is far from from te case/ 'ut te principle principle remains+ Since eac gene is/ of course/ only e9pected e9pe cted to correlate well wit one cluster/ eac gene in te rest of te dataset is added to te cluster wit wic it correlates 'est+
8+ Compute te distance 'etween eac gene and te center of its cluster+ cluster+ Te measurement of ow well eac gene 'elongs in its cluster is now determined 'y co mputing te 3earson correlation 'etween eac gene and te center of its cluster+ cluster+ Te resulting correlations are converted to to distances >distance #- correlation?+ Tese distance values values are te te %ey num'ers produced 'y te first alf of te algoritm+ algoritm+ Tey represent weter a gene vectors vectors sape is similar to te sape of gene vectors %nown to 'e significant+ Te 'asis of tis researc is tat low distances tend to imply differentially differentially e9pressed genes+ &nfortunately/ &nfortunately/ distance values alone are not an accurate measure of significance G ran%ing genes 'y distance alone produces results far worse tat te original Cy'er-ANOV Cy'er-ANOVA+ A+ Instead/ te distance >JDK? values must 'e com'ined wit te original 3 values to give new values 'etter tan eiter alone+
7+ Compute a com'ined measurement of significance significance for eac gene+ Te goal now is to ta%e te 3 value and te D value and com'ine tem in some way to ma%e a new value+ !ust li%e wit te calculation of te previous pro'a'ilities/ wat we are really interested in is finding te pro'a'ility of a non-differentially non-differentially e9pressed gene wit a certain standard deviation attaining 'ot ,
a 3 value less tan or e:ual to te o'served 3 value and a D value less tan or e:ual to te o'served D value+ &nfortunately/ &nfortunately/ te teoretical 'asis for te connection 'etween 3 and D values is not well defined/ and it may 'e very complicated and e9periment-dependentE tere may well 'e no useful analytic solution to tis pro'lem+ Instead/ we generate te distri'ution of 3 and D values for eac e9periment+ "irst tousands of non-differentially non-differentially e9pressed genes are generated using te standard deviation tat would 'e e9pected at teir intensity level and teir 3 and D values are calculated+ Ten/ for eac gene in te data set/ set/ te num'er of non-differentially non-differentially e9pressed genes tat ave 'ot 3 and D values e:ual to or lower tan te 3 and D values of tat gene is counted+ Tat num'er is divided 'y te num'er of non-differentially non-differentially e9pressed genes generated/ and tat :uotient gives an appro9imate pro'a'ility+
It is important to note tat for igly significant genes/ tis metod does not give accurate pro'a'ilities/ for if # non-differentially e9pressed genes are generated/ only genes wit an actual pro'a'ility of occurring of a'out #- are li%ely li%ely to 'e generated+ Tus/ genes in te te data set tat are actually significant to a pro'a'ility of 'elow #- will almost certainly not ave any generated genes more significant and will will all 'e assigned pro'a'ilities pro'a'ilities of ;ero+ (ut tis is not a maFor concern/ since tese genes are so significant tat tere is little :uestion tat tey are differentially differentially e9pressed/ and teir precise ran% is not li%ely to 'e important+ If it is/ ten te genes wit ;ero com'ined values can 'e ran%ed internally 'y teir 3 values+ In tis case/ witin witin te very most significant genes/ te algoritm will not ave ad any effect+
important parameters are te num'ers of differentially and non-differentially e9pressed genes/ num'er of groups/ cluster sapes/ standard deviations/ and fold canges+
"or eac data set/ a'out , genes were created/ wit te num'er of differentially e9pressed genes ranging from a'out # to to a'out + Te rest were genes wose mean value did not cange across conditions/ cond itions/ 'ut tat mean value did vary 'etween different nondifferentially differentially e9pressed genes+ Te num'er of groups used is a %ey determinant of te performance of te algoritm/ more groups causing 'etter performance+ Most of te data sets generated used si9 groups/ 'ut a set wit twenty groups was tried+
Te cluster sapes of non-differentially e9pressed genes are/ of course/ flat lines/ for tat is te definition of 'eing non-differentially non-differentially e9pressed+ Te cluster sapes of te differentially differentially e9pressed genes were more complicated+ Tey are drawn loosely loosely from aoet. ao et. al. >#? al. >#? 'ut really te specific sapes cosen are not all tat important+ An important property of te 3earson correlation is tat/ given some e9pression vector/ te pro'a'ility of some oter e9pression vector aving a distance to te original vector 'elow some value is te same regardless of te sape of te original vectorE it depends only on te num'er of conditions conditions >,?+ On te oter and/ given two original vectors/ te pro'a'ility of a test vector aving a distance 'elow some value to eiter one of tem does depend on te sapes of te original vectors+ As intuition suggests/ vectors of opposite sape >li%e a linear increasing vector and linear decreasing vector? ma%e it more li%ely for a test vector to cluster well wit one of tem+ To te e9tent tat cluster sape does affect te algoritm/ te more similar te cluster sapes/ sapes/ te 'etter te algoritm will perform+ Cluster sapes can 'e graped and assigned names 'ased on teir sape+ Data was generated using up to twelve cluster sapes >"igure #a/ #'?+
#
Figure 1A. 1A. Six of the Twelv e Cluster Shapes 11
l e v e L n o i s s e r p x E g o L
10
Spike 9
Dip Cli Up
8
Cli Down !ill
7
"alle# 6 5 1
2
3
4
5
6
Condition
Figure 1B. The Other Othe r Six Cluster Shapes Shapes
Linear Gradient Up
11 Linear Gradient Down
10 l e v e L n o i s s e r p x E g o L
9
Concave Up Gradient Up
8 Concave Down Gradient Up
7
Concave Up Gradient Down
6 5 1
2
3
4
5
6
Condition
Concave Down Gradient Down
Te metod of assigning standard deviations was co sen to 'e as realistic as possi'le+ Cy'er-ANOVA Cy'er-ANOVA ta%es into account te fact tat standard deviation often canges significantly wit a'solute intensity in microarray e9periments/ and te generated data model tis ##
penomenon+ Te data of @ong/ et+ al/ >? wic is freely availa'le/ was graped/ mean intensity vs+ standard deviation/ and a :uadratic :uadratic regression calculated+ Tis regression e:uation is an e9plicit function for standard deviation in terms of mean/ and it was used wit a modification to calculate standard deviations for te generated data+ Te modification is re:uired re:uired 'ecause it is not te case tat standard deviation depends solely on intensity+ intensity+ Te mean intensity vs+ standard deviation grap does not ma%e a perfect lineE tere is considera'le widt to tat curve/ and tis too was simulated+ Te standard deviation itself was randomi;ed wit a small small meta-standard deviation+ Te assumption of normality normality in tis generation is is pro'a'ly false/ 'ut te cange in te algoritm performance caused 'y introducing te meta-randomi;ation at all is so small tat it seems igly unli%ely tat a 'etter model would produce a significant difference+
To ta%e full advantage of te standard deviation dependence on mean/ non-differentially e9pressed genes were generated wit means separated 'y undredts 'etween , and #8 >on a log* scale?+ Tis created a considera'le range of standard deviations for for eac data set+ set+ Te means of differentially e9pressed genes were generated starting from , different 'aselines/ te integers from , to #+
.ac differentially e9pressed gene was generated at five different fold canges1 #+/ #+7/ *+/ 6+/ and 8+ on an unlogged scale+ Te si9 and tree fold cange genes generally made up te seed clustering group/ and te #+ and #+7 fold cange genes were responsi'le for most of te difference in performance 'etween algoritms+
Scoring Algoritms Once data as 'een generated and assigned 3 values using Cy'er-ANOVA Cy'er-ANOVA and com'ined values using te clustering algoritm/ tere must 'e some way to compare te performance+ =e introduce a new metod for determining d etermining te performance of an algoritm/ wic we 'elieve to fi9 a sortcoming in te standard metod+
Te most common metod of scoring appears a ppears to 'e a consistency test >#-7? wic wor%s te following way+ Say tere is a data set wit N genes %nown to 'e differentially e9pressed+ #*
"irst/ some significance significance test is performed performed and eac gene assigned a significance level+ Ne9t/ te genes are ran%ed 'y significance level/ most significant first+ Out of te top N most significant genes/ te num'er D of genes g enes tat are actually differentially e9pressed is determined/ and te score is D L N+
Te pro'lem wit te consistency test is tat ignores te fact tat it often will matter weter te >N-D? genes are listed consecutively rigt after te Nt gene or at te very 'ottom of te list+ (iologists would prefer tat te differentially differentially e9pressed genes 'e listed as close to te top as possi'le/ for it is easier to find an interesting gene or group of genes if it is iger on te list+
As a solution/ we introduce introduce an algoritm called ran% sum+ $an% sum also ta%es a list of genes ran%ed 'y significance level/ 'ut computes te score differently+ differently+ $an% sum is e:ual to te sum of te ran%s of all te N differentially e9pressed genes minus >#*6N?/ te minimum score possi'le+ @ower ran% sums indicate indicate 'etter performance/ and a perfect significance significance test gives a ran% sum of ;ero+
(ut ran% sum as significant flaws/ too/ for it gives too muc weigt to te genes placed after te Nt ran%+ In te worst case/ one single misplaced gene could cange te score from to te num'er of genes in te e9periment/ e9periment/ , in tis case+ Bere is a successful modification+ Instead of summing te actual ran%s/ we sum for all differentially e9pressed genes ">$an%?/ were " is some increasing concave-down function+ Tere is no particular rigt coice for "/ "/ for te preferred function will vary depending on te e9periment and e9perimenter/ 'ut 'ot logaritmic and polynomial >to a power less tan/ say/ say/ #L*? functions give almost identical results+
Te modified ran% sum successfully determines ow well a significance test as ran%ed te genes+ (ut te score from te ran% sum itself is not particularly particularly useful informationE it must 'e placed in a conte9t+ Since te goal of any significance test is ultimately to reduce te num'er of replicates necessary to attain accurate results/ we convert te ran% sum to a new measure tat we call e:uivalent replicate gain+ .:uivalent replicate gain is defined defined as te num'er of replicates replicates #6
needed to reac te same ran% sum using only a standard Cy'er-ANOVA+ Cy'er-ANOVA+ Specifically/ Specifically/ tis is done 'y running Cy'er-ANOVA Cy'er-ANOVA on several data sets wit parameters all identical e9cept for te num'er of replicates+ Te grap of num'er of replicates replicates vs+ ran% sum is drawn and a regression calculated+ Te clustering algoritm algoritm is ten run and a ran% sum o'tained+ Tat ran% sum is entered into te inverse of te regression function to find te appro9imate num'er of replicates re:uired to attain te same performance using only Cy'er-ANOVA+ Cy'er-ANOVA+ Te e:uivalent replicate gain is te difference 'etween tat num'er of replicates and te num'er of replicates actually used+ Anoter useful measure is te e:uivalent replicate gain percentage/ te e:uivalent replicate replicate gain divided 'y te e:uivalent replicate num'er+ num'er+ Tis gives an appro9imation appro9imation of te percent cost saving possi'le 'y using te te algoritm and fewer replicates+ replicates+ Te com'ination of te tecni:ues of ran% sum and e:uivalent replicate gain percentage as 'een very successful+
$esults =e ave generated a num'er of artificial data sets using a wide range of parameters and found te e:uivalent replicate gain percentage of te te algoritm in many situations+ situations+ Te most important parameters were found 'e te num'er of conditions/ te num'er of cluster sapes/ te num'er of replicates/ te num'er of non-differentiat n on-differentiated ed genes in te initial clustering/ and te num'er of clusters omitted from te initial clustering+
In a'solutely ideal situations/ situations/ te algoritm can give an enormous 'enefit+ An e9periment wit twenty conditions/ one cluster sape/ two replicates/ and perfect initial clustering approaces suc an ideal situation and gives e9cellent results >Ta'le >Ta'le #? +
Num'er of Conditions 8 *
.:uivalent $eplicate Num'er 6+46 8+*,
.:uivalent $eplicate
Ta'le #+ 3erformance of te clustering algoritm on data wit two num'ers of conditions/ conditions/ one cluster sape/ two replicates/ and perfect initial clustering+
A more realistic case would involve fewer conditions/ say/ si9/ and particularly wit tis smaller num'er of conditions/ te num'er of cluster sapes 'ecomes a factor in te performance of te algoritm+ Te performance 'enefit is still still :uite significant1 significant1 recall tat e:uivalent replicate gain percentage is an appro9imation appro9imation of te cost saving due to to fewer replicates+ As e9pected/ te performance 'enefit is reduced 'y additional cluster sapes >Ta'le *?+
Num'er of Cluster
.:uivalent $eplicate
.:uivalent $eplicate
Sapes Num'er
.ven wen identical parameters are used to generate te data/ te ran% sum of any algoritm will fluctuate 'etween data sets differing only 'y random num'er ge neration+ "ortunately/ "ortunately/ tis variation appears to affect Cy'er-ANOVA Cy'er-ANOVA and te Clustering algoritm e:ually e:u ally// for te e:uivalent replicate gain percentage does not cange significantly >Ta'le >Ta'le 6?+ Tis invariance as te 'eneficial effect of ma%ing it unnece ssary to run multiple data sets for eac parameter com'inationE more tan one adds little information+
Data Set
.:uivalent $eplicate
.:uivalent $eplicate
# * 6
Num'er 6+64 6+47 6+*5
4
6+*
6,+ #
Ta'le 6+ 3erformance of te clustering algoritm on data generated randomly four times using identical parameters1 si9 conditions/ two replicates/ two cluster sapes/ and perfect initial clustering+
Cange in te standard deviation or in te fold canges of te genes will certainly affect 'ot te ran% sum from 'ot Cy'er-ANOVA Cy'er-ANOVA and clustering+ clustering+ Over small canges/ te e:uivalent replicate gain percentage is not seriously affected/ 'ut if te standard deviation 'ecomes so large as to corrupt te initial clustering/ te performance loss is significant >Ta'le >Ta'le 4?+
Standard Deviation
.:uivalent $eplicate
.:uivalent $eplicate
>in multiples of te
Num'er
normal SD used elsewere? 9# 6+64 4+# 9* 6+#4 68+6 94 *+86 *4+ Ta'le 4+ 3erformance of te clustering algoritm on data wit tree levels of standard standard deviation/ using si9 conditions/ two replicates/ two cluster sapes/ and perfect initial clustering+
@i%e Cy'er-ANOVA/ Cy'er-ANOVA/ and in fact all statistical tests/ te e:uivalent replicate gain percentage of te algoritm decreases as te num'er of replicates 'ecomes larger and te test 'ecomes more accurate >Ta'le >Ta'le ?+ If te initial clustering is correct/ correct/ tere is no loss in performance/ 'ut in e9periments wit parameters similar similar to tose of Ta'le Ta'le / te algoritm pro'a'ly ceases to 'e wortwile after four replicates+ Algoritm Tested Clustering Clustering Clustering Cy'er-ANOVA Cy'er-ANOVA Cy'er-ANOVA
Num'er of $eplicates * 4 , * 4 ,
.:uivalent $e $eplicate Num'er *+,4 4+74 ,+#* *+56 +*6 ,+7
.:uivalent $eplicate
#8
Ta'le + 3erformance of te clustering algoritm on data wit tree num'ers of replicates/ replicates/ using si9 conditions/ twelve cluster cluster sapes/ and perfect initial initial clustering+ Te rate of decrease in performance at iger num'er of replicates is compared wit tat rate in Cy'er-ANOV Cy'er-ANOVA+ A+ Note tat for te clustering algoritm/ te e:uivalent replicate gain percentage is e9pressed as a gain from Cy'er-ANOVA/ Cy'er-ANOVA/ 'ut tat for Cy'er-ANOVA/ Cy'er-ANOVA/ te e:uivalent replicate gain percentage is e9pressed e9press ed as a gain from a standard ANOVA ANOVA test+ Tus/ on te scale tat ta t Cy'er-ANOVA Cy'er-ANOVA is 'eing compared to/ te clustering algoritm result for a two replicate da ta set is e:uivalent to te plain ANOVA of four replicates+
Te most serious impact to te algoritms performance occurs if te initial clustering is done incorrectly+ incorrectly+ Te most difficult and important important step is coosing te num'er of genes to include in te initial groupE if eiter too many or too few are cosen/ te performance of te algoritm will 'e diminised significantly significantly++ Te ultimate goal is to coose a set of genes tat includes all te clusters c lusters of differentially e9pressed genes witout including any non-differentially e9pressed genes+ It is true tat tis migt migt 'e very difficult to do wit e9perimental data/ 'ut it is fortunately true tat te algoritm is ro'ust to to small departures from optimality >Ta'le >Ta'le 8?+ Te only time tat te algoritms performance drops 'elow te performance of Cy'er-ANOVA Cy'er-ANOVA is wen almost all of te clusters are left out+
Num'er of
.9pected
Num'er of
Num'er of
.:uivalent
.:uivalent
Clusters
non-
$eplicate
$eplicate
Omitted
differentially
Num'er
3ercentage
e9pressed 6 # * 6
+ +* + +5 +*4
# 5 6
genes included
#+7 #+56 *+## *+#* *+64
-*#+8 -6+6 +#, +,4 #4+7 #7
# # +6 *+86 *6+5 6 5# # *+ *#+4 4 6 *6 *+67 #+, 458 5 *+5 4+66 Ta'le 8+ 3erformance of te clustering algoritm on data wit several e9pected false positive positive rates for te initial clustering/ using si9 conditions/ two replicates/ and twelve cluster sapes+
Discussion Te potential cost saving of te algoritm in cases wen te correlation 'etween distance and significance is strong is igly significant+ significant+ It is interesting to note tat for normal data wit wit two replicates suc as tat in Ta'le / te percent e:uivalent replicate gain 'etween Cy'erANOVA ANOVA and Clustering is rougly roug ly e:ual to te performance gain 'etween standard ANOVA ANOVA and Cy'er-ANOVA/ on te order of 6+
It is true tat tis performance degrades in some cases/ 'ut tis is also true of Cy'erANOVA/ ANOVA/ and also of oter algoritms >6/ 4?+ One situation tat causes te percentage e:uivalent replicate gain of all tese algoritms to decrease is iger num'er of replicates+ replicates+ &nfortunately/ &nfortunately/ te performance decreases more rapidly rapidly in te clustering algoritm+ algoritm+ Tis penomenon is peraps 'est e9plained 'y a :uality of information argument1 additional replicates increase increase te :uality of information of te 3 values/ 'ut ave little effect on te :u ality of te information of te D values+ Additional replicates will cause differentially differentially e9pressed genes to cluster 'etter/ 'etter/ in a sense reducing te num'er of false negatives/ neg atives/ 'ut will not reduce te false positives/ 'ecause non-differentially e9pressed genes are Fust as li%ely to randomly ave low distances given any num'er of replicates+ Tus/ as te num'er of replicates replicates increases/ te D values gradually cause less improvement+
Te most maFor degradation deg radation of te clustering algoritms algoritms performance/ owever/ is in a situation tat does not affect affect te oter algoritms/ tat of 'ad initial clustering+ clustering+ $educing te severity of tis pro'lem will 'e te primary direction for furter researc on tis algoritm+ Some possi'le solutions include >#? clustering all te genes 'ut weiging teir significance in te clustering algoritm 'y 3 valueE and >*? setting te num'er of genes to use as seeds 'y trying #,
many values and seeing te point a'ove wic new clusters stop appearing to form+ $egardless of weter tis issue can 'e resolved in te general case/ if in some e9periment it is strongly suspected >peraps on 'iological grounds? tat no clusters ave 'een omitted from te initial set of genes/ tis clustering metod can 'e used and safely e9pected to produce a su'stantial performance 'enefit+
$eferences #+ Dudoit/ Dudoit / S+ 0ang/ 0+/ 0+/ Callow/ M+!+/ and Speed/ T+3 T+3++ >*?+ Statistical Statistic al metods for identifying differentially differentially e9pressed genes in replicated cDNA microarray microarray e9periments+ Tecnical report P7,/ Stat Dept/ &C-(er%eley+ *. )nudsen/ S+ >**? A >**? A Biologist’s Guide to Analysis of DNA Microarray Data+ Data+ !on =iley Q Sons+ 6+ 3an/ =+/ >**? A comparative review of statistical metods for discovering differentially e9pressed genes in replicated microarray e9periments+ Bioinformatics #,>4?148-4 Bioinformatics #,>4?148-4 4. Tuser/ Tuser/ V+<+/ V+<+/ Ti'sirani/ $+/ and Cu/ <+ >*#?+ Significance analysis of microarrays applied to te ioni;ing radiation response+ Proc. Natl. Acad. Acad. Sci. &SA Sci. &SA 5,1##5-#*#+ + @ong/ D+/ Mangalam/ B+/ Can/ (+/ Tolleri/ Tolleri/ @+/ Batfield/ <+/ (aldi/ 3+/ 3+/ Improved statistical inference from DNA microarray microarray data using analysis of variance va riance and a (ayesian statistical "ramewor%+ Journal of Biological Chemistry. Chemistry. *78 >6?1 #5567-#5544+ 8. (aldi/ 3+/ 3+/ and (runa%/ S+/ >#55,? Bioinformatics: >#55,? Bioinformatics: he Machine !earning A""roach A""roach++ MIT 3ress/ Cam'ridge MA+ 7+ (aldi/ 3+ 3+ and @ong/ A+D+ A+D+ >*#?+ A (ayesian (ayesian framewor% for te analysis analysis of microarray e9pression data1 $egulari;ed t-test and statistical inferences of gene canges+ Bioinformatics #715-#5 ,+ )reys;ig/ .+ >#57? Mathematical >#57? Mathematical Statistics: Princi"les Princi"les and Methods+ Methods+ !on =iley =iley QSons/ Inc+ 5+ .wens/ =+/ =+/ and *#? Statistical Methods in Bioinformatics: An #ntroduction+ #ntroduction+ Spriger-Verlag/ N0+ N0+ #+ ao/ $+/ *?+ Analysis of p6-regulated gene e9pression patterns using oligonucleotide arrays+ Genes and De$elo"ment + #4 >,?1 5,#-556+ #5
##+ !iang M/ $yu !/ )iraly M/ Du%e )/ $ein%e V/ and )im S)+ #576? Pattern >#576? Pattern Classification and Scene Analysis+ Analysis+ !on =iley =iley and Sons+ #6+ Saran/ $+/ and Samir/ Samir/ $+ C@IC)1 a clustering clustering algoritm wit applications to to gene e9pression analysis+ In Proceedings In Proceedings of the &''' Conference Conference on #ntelligent Systems for Molecular Biology ,#SMB''-0 !a Jolla0 CA0 67-6#8 CA0 67-6#8 #4+ Alon/ &+/ (ar%ai/ N/+ notterman/ D+A+/ #555? (road patterns of gene e9pression revalued 'y clustering analysis of tumor and normal colon tissues pro'ed 'y oligonucleotide arrays+ Proc. Natl. Acd. Sci. %SA/ %SA/ 581874-87 #+ Serloc%/ <+ Analysis of large-scale gene e9pression data+ Curr 1"in #mmunol * * AprE#*>*?1*#-
*