Analisis Komponen Utama/ Principal Component Analysis (Teori) •
•
Tujuannya mereduksi dimensi peubah yang saling berkorelasi menjadi peuba h2 baru yang tidak berkorelasi dengan tetap mempertahankan sebanyak mungkin keragaman data asalnya. (patokan 80%) Misal ada 1000 variable apa kelebihan n kekurangannya..!! 1. Terlalu rumit 2. "egi interpretasi sulit "ehingga perlu dilakukan reduksi data. "yaratnya harus ada korelasi kuat antar variable. #angkah langkah $& ' $* engujian hipotesis matriks korelasi+ melihat ada tidaknya korelasi yang erat antar variable. dengan menggunakan uji bartlet* H : ρ =I ("elain diagonal utama,0 $rtinya korelasi antar peubah 0) - * ρ≠ I ("elain diagonal utama,0 $rtinya korelasi yang erat antar peubah 0
p
1
p
UJi Barlett:
n = jumlah observasi p , jumlah variable R , matrik korelasi (estimasi) = determinan matrik korelasi Tolak -0 jika
x > x arena kita niatnya make $& $& yang kita harapkan Tolak Tolak -0. $rtinya $rtinya antar variable a/al ada korelasi sehingga tujuan reduksi' penyusutan dimensi data menjadi terapai. 1. Men Me nar arii akar akar ir irii dari dari mat matri riks ks kova kovari rian an (") (") ata atau u basi basiss kore korela lasi si () ().. ika ika sat satua uan n vari variab able le sama pake kovarian jika satuan berbeda pake korelasi. 2. Meng Me ngur urut utka kan n akar akar i iri ri yan ang g dipe dipero role leh h dari dari te terb rbes esar ar ke ter terke kei ill (3 (3 43 4 3 4 0) 2
hitung
2
tabel
1
5.
2... 2...
p
Membuat Memb uat pe peuba ubah h baru baru (k (kom ompo pone nen n utam utama) a) yang yang me meru rupa paka kan n komb kombin inas asii lin linea earr dari dari pe peub ubah ah asalnya.
Membuat vetor iri yang dinormalisasi (dibuat orthonormal) dari masing2 akar iri yang bersesuaian 6 ,e 7,e 9 :;:e 9 6 ,e 7,e 9 :;:e 9 1
1
11
1
1p
p
2
2
21
1
2p
p
; 6 ,e 7,e 9 :;:e 9
9 ; 9 ? "i@at peubah baru* Tidak saling berkorelasi dan berurutan dari ukuran kepentingannya. 61 paling penting sampai 6p 6p p
p
p1
1
pp
1
1.
p
p
Mela Me laku kuka kan n pro prose ses s red reduk uksi si KU yan ang g ter terbe bent ntuk uk.. Ada 3 ar ara a: =engan pr proporsi keragaman (b (bagi akar i iri pe per total ak akar i iri)
1. 2.
$kar iri A1
5.
"ree plot Misal * proporsi keragaman
6 ,e 7,e 9 :;:e 9
BC%
6 ,e 7,e 9 :;:e 9
25%
1
2
1
11
2
21
1
1p
1
2p
p
p
; 6 ,e 7,e 9 :;:e 9 p
p
p1
1
pp
p
roporsi keragaman variable baru pertama belum ukup sehingga ditambah dengan variable baru kedua. adi banyaknya & yang terbentuk adalah 2. $kar iri* "elama akar irinya A1 itulah banyakn ya &. "ree plot * dilihat landau uramnya dan besarnya akar iri. (sree plot tu plot antara jumlah variable dengan akar irinya)
Melakukan pena!aan pada KU yang digunakan setelah ter"adi proses reduksi. Ada 2 ara: 1. ore o rela lasi si ant antar ar & & deng dengan an var varia iabl blee asal asalny nya. a. or orel elas asii yang yang be besa sarr tu tu yang yang me meni niri rikan kan & 2. =engan melihat penimbang (weighting ) 61,e17,e 9 :;:e p9p penimbang tu eDnya. enimbangnya enimbangnya yang paling besar. alo penimbangnya beda2 tipis berarti & diirikan oleh variable2 tsb. 11
1
1
5f unc t i onst odoPr i nc i pal Component s Anal y s i si nR Po s t e do nJ u ne1 7,2 0 1 2
Pr i n c i p alCo mp on en tAn al y s i s( PCA)i s a mul t i v ar i at et ec hni que t hata l l ows us t o s ummar i z et hes y st emat i cpat t er nsofv ar i at i onsi nt hedat a. Fr om adat aanal y s i ss t andpoi nt ,PCAi sus edf ors t udy i ngonet abl eofobs er v at i onsand v ar i abl eswi t ht hemai ni deaoft r ans f or mi ngt heobs er v edv ar i abl esi nt oas etofnew v ar i abl es ,t hepr i nc i palc omponent s ,whi c har eunc or r el at edande xpl ai nt hev ar i at i oni n t h ed at a .F ort h i sr e as o n,PCA a l l o ws t or e du c e a“ c o mp l e x ”d a t as e tt o al o we r di mens i oni nor dert or ev ealt hes t r uc t ur esort hedomi nantt y pesofv ar i at i onsi nbot h t heobs er v at i onsandt hev ar i abl es .
PCAi nR I nR,t her ear es ev e r alf unc t i onsf r om di ff er entpac k agest hatal l o w ust oper f or m PCA. I nt hi spos tI ’ l ls howy ou5di ffer entway st odoaPCAus i ngt hef ol l owi ngf unc t i ons( wi t h t hei rc or r es pondi ngpac k agesi npar ent hes es ) : •
prcomp() ( s t at s )
•
princomp() ( s t at s )
•
PCA() ( F ac t o Mi n eR)
•
dudi.pca() ( a de 4)
•
acp() ( amap)
Br i efnot e:I ti snoc oi nc i denc et hatt het hr eeex t er nalpac kages( "FactoMineR", "ade4", and"amap")ha v ebeende v el o pedb yFr enc hdat aanal y s t s ,whi c hha v eal ongt r adi t i on andpr ef er enc ef orPCAandot herr el at ede xpl or at or yt ec hni ques . Nomat t erwhatf unc t i ony oudec i det ous e,t het y pi c alPCAr es ul t sshoul dc ons i s tofa s etofei genv al ues ,at abl ewi t ht hes c or esorPr i nc i palComponent s( PCs ) ,andat abl e o fl o ad i n gs ( o rc or r e l at i o ns b et we en v a r i a bl es a nd PCs ) .Th ee i g en v al u es p r o v i d e i nf or mat i on oft he v ar i abi l i t yi nt he dat a.The s c or es pr o vi de i nf or mat i on aboutt he s t r uc t ur eoft heobs er v at i ons .Thel oadi ngs( orc or r el at i ons )al l ow y out ogetas ens eof t he r el at i ons hi psbet ween v ar i abl es ,aswel last hei ra ss oc i at i onswi t ht he ex t r ac t ed PCs.
TheDat a T omak et hi ngseas i er ,we ’ l lus et hedat as etUSArrests t hatal r eadyc omeswi t hR.I t ’ sa d at af r a me wi t h5 0r o ws( USA s t a t e s )a nd 4 c o l u mn sc o nt a i n i n gi n f o r ma t i o na bo ut v i o l entc r i mer at esb yUSSt at e.Si nc emos toft het i mest hev a r i ab l esar emeas ur edi n di ffer entsc al es,t he PCA mustbe per f or med wi t h st andar di zed dat a( mean = 0, v a r i an c e=1 ) .Theg oodne wsi st h ata l lo ft h ef un c t i o nst h atp er f or m PCA c o mewi t h par amet er st os pec i f yt hatt heanal y s i smus tbeappl i edons t andar di z eddat a.
Opt i on1:us i ngpr c omp( ) Thef unc t i onprcomp() c o meswi t ht hedef aul t"stats" p a c k a g e,wh i c hme an st h aty o u don’ tha v et oi ns t al lan yt hi ng.I ti sper hapst hequi c k es twa yt odoaPCA i fy oudon’ t wantt oi ns t al l ot herpac kages . # PCA with function prcomp pca1 = prcomp(USArrests, scale. = TRUE) # sqrt of eigenvalues pca1$sdev !1 1.#$4% &.%%4% &.#%$1 &.41'4 # loadings head (pca1$rotation)
PC1
PC
PC
PC4 &.'4%
Murder
*&.##%
&.41+ *&.41
Assault
*&.#+
&.1++& *&.'+1 *&.$441
UranPop *&.$+ *&.+$+ *&.$+&
&.1++
Rape
&.&+%&
*&.#44 *&.1'$
&.+1$+
# PCs (aka scores) head (pca1$-)
PC1
PC
PC
PC4 &.1#4$&
Alaama
*&.%$#$
1.1& *&.4%+&
Alasa
*1.%
1.&'4
.&1%#& *&.441+
Ari/ona
*1.$4#4 *&.$+#
&. *&.+''
Aransas
&.14&&
1.1&+#
&.114 *&.1+&%$
Cali0ornia *.4%+' *1.#$4
&.#%#4 *&.+#'
Colorado
1.&+4&&
*1.4%% *&.%$$'
&.&&14#
Opt i on2:us i ngpr i nc omp( ) The f unc t i onprincomp() a l s oc ome s wi t ht h ed ef a ul t"stats" pac k age,and i ti sv e r y s i mi l art oherc ous i nprcomp().WhatIdon’ tl i k eofprincomp() i st hats omet i mesi twon’ t di s pl ayal l t hev al uesf ort hel oadi ngs ,butt hi si sami nordet ai l . # PCA with function princomp pca = princomp(USArrests, cor = TRUE) # sqrt of eigenvalues pca$sdev Comp.1 Comp. Comp. Comp.4 1.#$4% &.%%4% &.#%$1 &.41'4 # loadings unclass(pca$loadins)
Comp.1
Comp.
Comp.
Comp.4 &.'4%
Murder
*&.##%
&.41+ *&.41
Assault
*&.#+
&.1++& *&.'+1 *&.$441
UranPop *&.$+ *&.+$+ *&.$+&
&.1++
Rape
&.&+%&
*&.#44 *&.1'$
&.+1$+
# PCs (aka scores) head (pca$scores)
Comp.1
Comp.
Comp.
Comp.4 &.1#''$
Alaama
*&.%+#'
1.14 *&.444$
Alasa
*1.%#&1
1.&$
.&4&&& *&.4+#+
Ari/ona
*1.$' *&.$4'&
&.$+ *&.+4'#
Aransas
&.1414
1.11%+
&.114#$ *&.1++11
Cali0ornia *.#4& *1.#4%
&.#%+#' *&.41%%'
Colorado
1.&%#&1
*1.#14' *&.%+$'
&.&&14'#
Op t i on3 :us i n gPCA( ) Ah i g hl yr e c omme nd ed o pt i o n,e s p ec i a l l yi fy o u wa ntmo r ed et a i l e dr e s ul t sa nd as ses s i ngt ool s ,i st hePCA() f u nc t i o nf r o mt h e pa c k ag e"FactoMineR".I ti sbyf art he b es tPCA f u nc t i o ni nR a ndi tc o me swi t han umb ero fp ar a me t e r st h ata l l o wy o ut o t weakt heanal y s i si nav er yni c ewa y . # PCA with function PCA lirar2(FactoMineR) # apply PCA
pca = PCA(USArrests, rap3 = FALSE) # matrix with eigenvalues pca$ei
eienvalue percentae o0 variance cumulative percentae o0 variance
comp 1
.4+&
'.&&'
'.&1
comp
&.%+%+
4.$44
+'.$#
comp
&.#''
+.%14
%#.''
comp 4
&.1$4
4.'
1&&.&&
# correlations between variables and PCs pca$var$coord
im.1
im.
im.
im.4
Murder
&.+44& *&.41'&
&.&+
&.$&$
Assault
&.%1+4 *&.1+$&
&.1'&1 *&.&%#%
UranPop &.4+1
&.+'+
&.#$
&.#$#
Rape
&.1''# *&.4++
&.&$&$
&.+##+
# PCs (aka scores) head (pca$ind$coord)
im.
im.
im.4
Alaama
&.%+#' *1.14
&.444$
&.1#''$
Alasa
1.%#&1 *1.&$ *.&4&&& *&.4+#+
Ari/ona
1.$'
Aransas
im.1
&.$4'& *&.$+ *&.+4'#
*&.1414 *1.11%+ *&.114#$ *&.1++11
Cali0ornia
.#4&
1.#4% *&.#%+#' *&.41%%'
Colorado
1.#14'
&.%+$' *1.&%#&1
&.&&14'#
Opt i on4:us i ngdudi . pc a( ) Anot heropt i oni st ous et hedudi.pca() f u nc t i o nf r o mt hep ac k ag e"ade4"whi chhasa h ug ea mou nto fo t h erme t h od saswe l l a ss o mei n t er e s t i ngg r a ph i c s . # PCA with function dudipca lirar2(ade4) # apply PCA pca4 = dudi.pca(USArrests, n0 = #, scann0 = FALSE) # eigenvalues pca4$ei !1 .4+& &.%+%+ &.#'' &.1$4
# loadings pca4$c1
CS1
CS
CS
CS4 &.'4%
Murder
*&.##%
&.41+ *&.41
Assault
*&.#+
&.1++& *&.'+1 *&.$441
UranPop *&.$+ *&.+$+ *&.$+&
&.1++
Rape
&.&+%&
*&.#44 *&.1'$
&.+1$+
# correlations between variables and PCs pca4$co
Comp1
Comp
Comp
Comp4 &.$&$
Murder
*&.+44&
&.41'& *&.&+
Assault
*&.%1+4
&.1+$& *&.1'&1 *&.&%#%
UranPop *&.4+1 *&.+'+ *&.#$
&.#$#
Rape
&.&$&$
*&.+##+ *&.1''#
&.4++
# PCs head (pca4$li)
A-is1
A-is
A-is
A-is4 &.1#''$
Alaama
*&.%+#'
1.14 *&.444$
Alasa
*1.%#&1
1.&$
.&4&&& *&.4+#+
Ari/ona
*1.$' *&.$4'&
&.$+ *&.+4'#
Aransas
&.1414
1.11%+
&.114#$ *&.1++11
Cali0ornia *.#4& *1.#4%
&.#%+#' *&.41%%'
Colorado
1.&%#&1
*1.#14' *&.%+$'
&.&&14'#
Opt i on5:us i ngac p( ) Afi f t hpos s i bi l i t yi st heacp() f unc t i onf r om t hepac k age"amap". # PCA with function acp lirar2(amap) # apply PCA pca# = acp(USArrests) # sqrt of eigenvalues pca#$sdev Comp 1 Comp Comp Comp 4 1.#$4% &.%%4% &.#%$1 &.41'4 # loadings pca#$loadins
Comp 1
Comp
Comp
Comp 4
Murder
&.##%
&.41+ *&.41
&.'4%
Assault
&.#+
&.1++& *&.'+1 *&.$441
UranPop &.$+ *&.+$+ *&.$+&
&.1++
Rape
&.&+%&
&.#44 *&.1'$
&.+1$+
# scores head (pca#$scores)
Comp 1
Comp
Comp
Comp 4
Alaama
&.%$#$
1.1& *&.4%+&
&.1#4$&
Alasa
1.%
1.&'4
.&1%#& *&.441+
Ari/ona
1.$4#4 *&.$+#
&. *&.+''
Aransas
*&.14&&
1.1&+#
&.114 *&.1+&%$
Cali0ornia
.4%+' *1.#$4
&.#%#4 *&.+#'
Colorado
1.4%% *&.%$$'
1.&+4&&
&.&&14#
Ofc our s et hes ear enott heonl yopt i onst odoaPCA,butI ’ l l l ea v et heot herappr oac hes f oranot herpos t .
PCAp l o t s Ev e r y b od yu s esPCA t ov i s u al i z et h ed at a,a ndmos to ft h ed i s c u s s edf u nc t i o nsc ome wi t ht hei rownpl otf unc t i ons .Buty ouc anal s omak eus eoft hegr eatgr aphi c aldi s pl a ys of"plot".J u s tt os h ow y o uac o up l eo fpl o t s ,l e t ’ st a k et h eb as i cr e s ul t s f r o m prcomp(). Pl o t o f o b s e r v a t i o n s # load ggplot! lirar2(plot) # create data frame with scores scores = as.data.frame(pca1$-) # plot of observations plot(data = scores, aes(- = PC1, 2 = PC, lael = rownames(scores))) + eom53line(2intercept = &, colour = "ra2'#") + eom5vline(-intercept = &, colour = "ra2'#") + eom5te-t(colour = "tomato", alp3a = &.+, si/e = 4) + title("PCA plot o0 USA States * Crime Rates")
Ci r c l eo f c o r r e l a t i o n s # function to create a circle circle <- function(center = c(&, &), npoints = 1&&) 6 r = 1 tt = seq (&, !i, lent3 = npoints) -- = center!1 + r cos(tt) 22 = center!1 + r sin(tt) return(data.frame(- = --, 2 = 22))
7 corcir = circle(c(&, &), npoints = 1&&) # create data frame with correlations between variables and PCs correlations = as.data.frame(cor(USArrests, pca1 $-)) # data frame with arrows coordinates arro8s = data.frame(-1 = c(&, &, &, &), 21 = c(&, &, &, &), - = correlations$PC1, 2 = correlations$PC) # geom"path will do open circles plot() + eom5pat3(data = corcir, aes(- = -, 2 = 2), colour = "ra2'#") + eom5sement(data = arro8s, aes(- = -1, 2 = 21, -end = -, 2end = 2), colour = "ra2'#") +
eom5te-t(data = correlations, aes(- = PC1, 2 = PC, lael =
rownames(correlations))) +
eom53line(2intercept = &, colour = "ra2'#") + eom5vline(-intercept = &, colour = "ra2'#") + -lim(*1.1, 1.1) + 2lim(*1.1, 1.1) + las(- = "pc1 ai-s", 2 = "pc a-is") + title("Circle o0 correlations")
Publ i s hedi nc at egor i esho wt oT a gg edwi t hpr i nc i pal component sanal y si spc amul t i v ar i at e pl otR ← pr evi ous ne xt→ Seeal l pos t s→ ©Gast onSanchez . Al l c ont ent sunde r( CC)BY NCSAl i c e n s e, u nl es sot her wi s eno t ed. Di dy oufi ndt hi ss i t eus ef ul ?I fy es ,c on si derhe l pi ngmewi t hmy wi s hl i s t .
Principal Components and Factor Analysis This section covers principal components and factor analysis. The later includes both exploratory and confirmatory methods.
Principal Components The princomp( ) function produces an unrotated principal component analysis.
# #
Pricipal entering
#
Components
raw
data
from
t
the <-
s$mmary(t
and
extracting correlation
print
loadings(t plot(t,type=&lines&
cor=T!"
%ariance
#
acco$nted pc
# #
for loadings
scree the
PCs matrix
princomp(mydata,
#
t'scores
Analysis
principal
plot components
iplot(t
click to view Use cor=FALSE to base the principal components on the covariance matrix. Use the covmat= option to enter a correlation or covariance matrix directly. If entering a covariance matrix, include the optionn.obs=. The principal( ) function in the psych package can be used to extract and rotate principal components.
#
)arimax
#
otated retaining
Principal *
Components components
lirary(psych t
<-
principal(mydata,
nfactors=*,
rotate=&%arimax&
t # print res$lts mydata can be a raw data matrix or a covariance matrix. Pairwise deletion of missing data is used. rotate can "none", "varimax", "uatimax", "promax", "oblimin", "simplimax", or "cluster" .
Exploratory Factor Analysis
The factanal( ) function produces maximum likelihood factor analysis.
#+axim$mielihood.actorAnalysis #enteringrawdataandextracting/factors, #with%arimaxrotation t< -factanal (mydata,/,rota tion =&%arim ax& pri nt(t , #
plot
factor
load
d ig its=0, 3
c$ to1= 2/, y
sor t= T! "
factor
<-
plot(load,type=&n&
0
t'loadings4,3506
#
set
$p
plot
text(load,laels=names(mydata,cex=27 # add %ariale names
click to view The rotation= options include "varimax", "promax", and "none". !dd the option scores= "regression" or "artlett" to produce factor scores. Use the covmat= option to enter a correlation or covariance matrix directly. If entering a covariance matrix, include the option n.obs=. The factor.pa( ) fnction in the
psych package offers a number of factor analysis related functions, including principal axis
factoring.
#
Principal
Axis
.actor
Analysis
lirary(psych t
<-
factor2pa(mydata,
nfactors=/,
rotation=&%arimax&
t # print res$lts mydata can be a raw data matrix or a covariance matrix. Pairwise deletion of missing data is used. #otation can be "varimax" or "promax".
!eterminin" the #mber of Factors to Extract ! crucial decision in exploratory factor analysis is how many factors to extract. The nFactors package offer a suite of functions to aid in this decision. $etails on this methodology can be found in a
PowerPoint presentation by #aiche, #iopel, and
lais. %f course, any factor solution must be interpretable to be useful.
#
8etermine
lirary(n.actors
9$mer
of
.actors
to
"xtract
e%
<-
eigen(cor(mydata
ap
<-
#
get
eigen%al$es
parallel(s$:ect=nrow(mydata,%ar=ncol(mydata,
rep=3;;,cent=2;* n
<-
ncree(x=e%'%al$es,
aparallel=ap'eigen'e%pea
plotncree(n
click to view
$oin" Frther The Facto%ine& package offers a large number of additional functions for exploratory factor analysis. This includes the use of both uantitative and ualitative variables, as well as the inclusion of supplimentary variables and observations. &ere is an example of the types of graphs that you can create with this package.
#
PCA
)ariale
.actor
lirary(.acto+ine res$lt <- PCA(mydata # graphs generated a$tomatically
click to view Thye
$PA&otation package offers a wealth of rotation options beyond varimax and promax.
+ap
Principal Component Analysis (PCA)
Introduction rinipal omponent $nalysis ($) is a po/er@ul tool /hen you have many variables and you /ant to look into things that these variables an e9plain. $s the name o@ $ suggests $ @inds the ombination o@ your variables /hih e9plains the phenomena. En this sense PCA is useful when you want to reduce the number of the ariables . Fne ommon senario o@ $ is that you have n variables and you /ant to ombine them and make them 5 or G variables /ithout losing muh o@ the in@ormation that the original data have. More mathematially $ is trying to @ind some linear projetions o@ your data /hih preserve the in@ormation your data have. $ is one o@ the methods you may /ant to try i@ you have lots o@ #ikert data and try to understand /hat these data tell you. #etHs say /e asked the partiipants @our BDsale #ikert Iuestions about /hat they are about /hen hoosing a ne/ omputer and got the results like this. Particip Price Softwar AestheticBrand ant e s
P3
>
*
/
?
P0
7
/
0
0
P/
>
?
?
*
P?
*
7
3
/
P*
7
7
*
*
P>
>
?
0
/
P7
*
7
0
3
P@
>
*
?
?
P
/
*
>
7
P3;
3
/
7
*
P33
0
>
>
7
Particip Price Softwar AestheticBrand ant e s
P30
*
7
7
>
P3/
0
?
*
>
P3?
/
*
>
*
P3*
3
>
*
*
P3>
0
/
7
7
Price5 A new comp$ter is cheap to yo$ (35 strongly disagree B 75 strongly agree, oftware5 The on a new comp$ter allows yo$ to $se software yo$ want to $se (35 strongly disagree B 75 strongly agree, Aesthetics5 The appearance of a new comp$ter is appealing to yo$ (35 strongly disagree B 75 strongly agree, Drand5 The rand of the on a new comp$ter is appealing to yo$ (35 strongly disagree B 75 strongly agree Jo/ /hat you /ant to do is /hat ombination o@ these @our variables an e9plain the phenomena you observed. E /ill e9plain this /ith the e9ample ode.
R code e!ample #etHs prepare the same data sho/n in the table above.
Price <- c(>,7,>,*,7,>,*,>,/,3,0,*,0,/,3,0 oftware ,7,?,*,>,/ Aesthetics <- c(/,0,?,3,*,0,0,?,>,7,>,7,*,>,*,7 Drand <- c(?,0,*,/,*,/,3,?,7,*,7,>,>,*,*,7 data <- data2frame(Price, oftware, Aesthetics, Drand $t this point data looks pretty muh the same as the table above. Jo/ /e do $. En there are t/o @untions @or $* promp() and prinomp(). promp() uses a orrelation oe@@iient matri9 and prinomp() uses a variane ovariane matri9. Kut it seems that the results beome similar in many ases (/hih E havenHt @ormally tested so be are@ul) and the results gained @rom prinomp() have nie @eatures so here E use prinomp().
pca <- princomp(data, cor=T s$mmary(pca, loadings=T $nd here is the result o@ the $.
Emportance of components5 Comp23 Comp20 Comp2/ Comp2? tandard de%iation 32**@/3 ;2@;?;0 ;2>@3>>7/ ;2/70*777 Proportion of )ariance ;2>;7*707 ;20?;/;;> ;233>3>7> ;2;/**33 C$m$lati%e Proportion ;2>;7*707 ;2@?7@7// ;2>?;?; 32;;;;;;;; oadings5 Comp23 Comp20 Comp2/ Comp2? Price -;2*0/ ;2@?@ oftware -;2377 ;277 -;230; Aesthetics ;2*7 ;23/? ;20* -;27/? Drand ;2*@/ ;23>7 ;2?0/ ;2>7? E /ill e9plain ho/ to interpret this result in the ne9t setion.
Interpretation of the results of PCA #etHs take a look at the table @or loadings /hih mean the oe@@iients @or the Lne/ variables. Comp.1Comp.2Comp.3Comp.4
Price
-;2*0/
;2@?@
oftwar -;2377 e
;277
-;230;
Aestheti ;2*7 cs
;23/?
;20*
-;27/?
Drand
;23>7
;2?0/
;2>7?
;2*@/
Nrom the seond table (loadings) $ @ound @our ne/ variables /hih an e9plain the same in@ormation as the original @our variables (rie "o@t/are $esthetis and Krand) /hih are omp.1 to omp.G. $nd omp.1 is alulated as @ollo/s* Comp.1 = -0.523 * Price - 0.177 * Software + 0.597 * e!thetic! + 0.5"3 * #rand
Thus $ suess@ully @ound a ne/ ombination o@ the variables /hih is good. The ne9t thing /e /ant to kno/ is ho/ muh eah o@ ne/ variables has a po/er to e9plain the in@ormation that the original data have. Nor this you need to look at "tandard deiation and Cumulatie Proportion #of $ariance% in the result.
Comp.1Comp.2Comp.3Comp.4
tandard de%iation
32*>
;2@
;2>@
;2/@
C$m$lati%e Proportion
;2>3
;2@*
;2>
32;;
"tandard deviation means the standard deviation o@ the ne/ variables. $ alulates the ombination o@ the variables suh that ne/ variables have a large standard deviation. Thus generally a larger standard deviation means a better variable. $ heuristis is that /e take all the ne/ variables /hose standard deviations are roughly over 1.0 (so /e /ill take omp.1 and omp.2). $nother /ay to determine ho/ many ne/ variables /e /ant to take is to look at umulative proportion o@ variane. This means ho/ muh o@ the in@ormation that the original data have an be desribed by the ombination o@ the ne/ variables. Nor instane /ith only omp.1 /e an desribe C1% o@ the in@ormation the original data have. E@ /e use omp.1 and omp2 /e an desribe 8O% o@ them. Penerally 80% is onsidered as the number o@ the perentage /hih desribes the data /ell. "o in this e9ample /e an take omp.1 and omp.2 and ignore omp.5 and omp.G. En this manner /e an derease the number o@ the variables (in this e9ample @rom G variables to 2 variables). 6our ne9t task is to understand /hat the ne/ variable means in the onte9t o@ your data. $s /e have seen the @irst ne/ variable an be alulated as @ollo/s* Comp.1 = -0.523 * Price - 0.177 * Software + 0.597 * e!thetic! + 0.5"3 * #rand
Et is a very good idea to plot the data to see /hat this ne/ variable means. 6ou an use !core! to take the values o@ eah variable modeled by $.
plot(pca'scores4,36 arplot(pca'scores4,36 Qith the graphs (sorry E /as kinda laRy to upload the graph but you an Iuikly generate it by yoursel@) you an see artiipant 1 D 8 get negative values and the other partiipants get positive values. Et seems that this ne/ variable indiates /hether a user ares about rie and "o@t/are or $esthetis and Krand @or her omputer. "o /e probably an n ame this variable as LNeature'Nashion inde9 or something. There is no de@initive ans/er @or this part o@ $. 6ou need to go through your data and make sense /hat the ne/ variables mean by yoursel@.
PCA and &o'istic re'ression #ne you ha$e done the analysis %ith &'A( you !ay %ant to look into %hether the ne% $ariables an predit so!e pheno!ena %ell. )his is kinda like !ahine learning: *hether +eatures an lassi+y the data %ell. ,et-s say you ha$e asked the partiipants one !ore thing( %hih #they are using /*indo%s or Ma0 in your sur$ey( and the results are like this.
Particip Price Softwar AestheticBrand OS ant e s
P3
>
*
/
?
;
P0
7
/
0
0
;
P/
>
?
?
*
;
P?
*
7
3
/
;
P*
7
7
*
*
3
P>
>
?
0
/
;
P7
*
7
0
3
;
P@
>
*
?
?
;
P
/
*
>
7
3
P3;
3
/
7
*
3
P33
0
>
>
7
;
P30
*
7
7
>
3
P3/
0
?
*
>
3
P3?
/
*
>
*
3
P3*
3
>
*
*
3
P3>
0
/
7
7
3
ere %hat %e are going to do is to see %hether the ne% $ariables gi$en by &'A an predit the # people are using. # is or 1 in our ase( %hih !eans the dependent $ariable is bino!ial. )hus( %e are going to do logisti regression. %ill skip the details o+ logisti regression here. + you are interested( the details o+ logisti regression are a$ailable in a separate page. 4irst( %e prepare the data about #.
<- c(;,;,;,;,3,;,;,;,3,3,;,3,3,3,3,3 )hen( +it the +irst $ariable %e +ound through &'A / i.e.. 'o!p.10 to a logisti +untion.
model <- glm( F pca'scores4,36, family=inomial s$mmary(model 5o% you get the logisti +untion !odel.
Call5 glm(form$la = F pca'scores4, 36, family = inomial 8e%iance esid$als5 +in 3G +edian /G +ax -0237?> -;2??*@> ;2;3/0 ;2>;;3@ 32>*0>@ CoeHcients5 "stimate td2 "rror I %al$e Pr(JKIK (Entercept -;2;@/73 ;27?03> -;233/ ;23;0 pca'scores4, 36 32?07/ ;2>030 02/;3 ;2;03? L --- ignif2 codes5 ; MLLLN ;2;;3 MLLN ;2;3 MLN ;2;* M2N ;23 M N 3 (8ispersion parameter for inomial family taen to e 3 9$ll de%iance5 0023@3 on 3* degrees of freedom esid$al de%iance5 302;// on 3? degrees of freedom AEC5 3>2;// 9$mer of .isher coring iterations5 * ,et-s see ho% %ell this !odel predits the kind o+ #. 6ou an use +itted/0 +untion to see the predition.
tted(model 3 0 / ? * > 7 ;23*37/70/ ;2;?3*?? ;2/?>@7// ;2;??;>3// ;20**0;7?* ;2;7@;@>// ;2;0>?3>> @ 3; 33 30 3/ 3? ;2037???*? ;2@?//;7 ;2/>30?33 ;23;*7? ;27/?0@>?@ ;2@*3;/3 ;27>0@*37; 3* 3> ;27@3?@@ ;2>?3;@?3 )hese $alues represent the probabilities o+ being 1. 4or exa!ple( %e an expet 178 hane that &artiipant 1 is using # 1 based on the $ariable deri$ed by &'A. )hus( in this ase( &artiipant 1 is !ore likely to be using # ( %hih agrees %ith the sur$ey response. n this %ay( &'A an be used %ith regression !odels +or alulating the probability o+ a pheno!enon or !aking a predition.
actor Analysis
Introduction Nator $nalysis is another po/er@ul tool to understand /hat your data mean partiularly /hen you have many variables. Qhat Nator $nalysis does is to try to @ind hidden variables /hih e9plain the behavior o@ your observed variables. Fur interests here also lie in reduing the number o@ variables. "o /e hope that /e an @ind a smaller number o@ ne/ variables /hih e9plain your data /ell. En this sense it sounds very similar to$. $lthough the outome is very similar in terms o@ reduing the number o@ variables the approah to redue the number o@ variable is di@@erent. E /ill e9plain this in the ne9t setion. E@ you are a little more kno/ledgeable you may have heard o@ the terms like S9ploratory Nator $nalysis (SN$) and on@irmatory Nator $nalysis (N$). SN$ means that you donHt really kno/ /hat hidden variables (or @ators) e9ist and ho/ many they are. "o you are trying to @ind them. N$ means that you already have some guesses or models @or your hidden variables (or @ators) and you /ant to hek /hether your models are orret. En many ases your Nator $nalysis is SN$ and E e9plain it in this page. Qe are going to use a similar e9ample in $. #etHs say you have some data like this @rom your survey about /hat is important /hen they deide /hih omputer to buy. Particip Price Softwar AestheticBrand amily riend ant e s
P3
>
*
/
?
7
>
P0
7
/
0
0
0
/
P/
>
?
?
*
*
?
P?
*
7
3
/
>
7
P*
7
7
*
*
0
3
P>
>
?
0
/
?
*
P7
*
7
0
3
3
?
P@
>
*
?
?
7
*
P
/
*
>
7
/
?
Particip Price Softwar AestheticBrand amily riend ant e s
P3;
3
/
7
*
0
?
P33
0
>
>
7
>
*
P30
*
7
7
>
7
7
P3/
0
?
*
>
>
0
P3?
/
*
>
*
0
/
P3*
3
>
*
*
?
*
P3>
0
/
7
7
*
>
Price5 A new comp$ter is cheap to yo$ (35 strongly disagree B 75 strongly agree, oftware5 The on a new comp$ter allows yo$ to $se software yo$ want to $se (35 strongly disagree B 75 strongly agree, Aesthetics5 The appearance of a new comp$ter is appealing to yo$ (35 strongly disagree B 75 strongly agree, Drand5 The rand of the on a new comp$ter is appealing to yo$ (35 strongly disagree B 75 strongly agree, .riend5 Oo$r friends opinions are important to yo$ (35 strongly disagree B 75 strongly agree, .amily5 Oo$r familys opinions are important to yo$ (35 strongly disagree B 75 strongly agree2 Nor suess@ully doing Nator $nalysis /e need more data than this e9ample. E@ your /ant to @ind n @ators you /ant to have roughly 3n - $ndimensions o@ data and 5n - 10n samples. $nd Nator $nalysis assumes the normality o@ the data so it is not a great tool @or ordinal data. -o/ever in pratie /e an use Nator $nalysis on ordinal data i@ the sale is O or more and data an be treated as interval data.
Through Nator $nalysis you /ant to @ind hidden variables (common factors ) /hih may e9plain the responses you gained. Nor looking at ho/ to do Nator $nalysis in E /ould like to brie@ly e9plain the di@@erene bet/een $ and N$.
(ifference between )actor Analysis and PCA
*he intuition of Principal Component Analysis is to find new combination of ariables which form lar'er ariances . Qhy are larger varianes important! This is a similar onept o@ entropy in in@ormation theory. #etHs say you have t/o variables. Fne o@ them (ar 1) @orms J(1 0.01) and the other (ar 2) @orms J(1 1). Qhih variable do you think has more in@ormation! ar 1 is al/ays pretty muh 1 /hereas ar 2 an take a /ider range o@ values like 0 or 2. Thus ar 2 has more hanes to have various values than ar 1 /hih means ar 2Hs entropy is larger than ar 1Hs. Thus /e an say ar 2 ontains more in@ormation than ar 1.
$lthough the e9ample above just looks at one variable at one time PCA tries to find linear combination of the ariables which contain much information by loo+in' at the ariance . This is /hy the standard deviation is one o@ the important metris to determine the number o@ ne/ variables in $. $nother interesting aspet o@ the ne/ variables derived by $ is that all ne/ variables are orthogonal. 6ou an think that $ is rotating and translating the data suh that the @irst a9is ontains the most in@ormation and the seond has the seond most in@ormation and so @orth. *he intuition of )actor Analysis is to find hidden ariables which affect your obsered ariables by loo+in' at the correlation . E@ one variable is orrelated /ith another variables /e an say that these t/o variables are generated @rom one hidden variable so /e an e9plain the phenomena /ith that one hidden variable instead o@ the t/o variable. #etHs take a look at the orrelation matri9 o@ the data /e have (see the ode e9ample belo/ to reate the data @rame) be@ore doing Nator $nalysis.
cor(data $nd you get the orrelation matri9.
Price oftware Aesthetics Drand .riend .amily Price 32;;;;;;;; ;23@*>30/ -;2>/0;003 -;2*@;0>>@; ;2;/;@0;;> -;2;>3@/33@ oftware ;23@*>30/; 32;;;;;;; -;23?>03*3> -;233@*@>?* ;23;;>77? ;237>*70/> Aesthetics -;2>/0;003 -;23?>03*0 32;;;;;;;; ;2@*0@*?/> ;2;/@7 -;2;>77/>; Drand -;2*@;0>>@; -;233@*@>? ;2@*0@*?/> 32;;;;;;;; ;2///3>73 ;2;0>>0/@ .riend ;2;/;@0;;> ;23;;>77 ;2;/@7 ;2///3>73 32;;;;;;;; ;2>;7073@ .amily -;2;>3@/33@ ;237>*70? -;2;>77/>; ;2;0>>0/@ ;2>;7073@ 32;;;;;;;; "o it looks like that rie has strong negative orrelations /ith $esthetis and Krand and Nriend has a strong orrelation /ith Namily. This means that /e an e9pet that /e /ill have t/o ommon @ators and one /ill be related to rie $esthetis and Krand and the other /ill be related to Nriend and Namily. #etHs move on to Nator $nalysis and see /hat /ill happen.
R code e!ample En the @ollo/ing ode e9ample E skipped some details suh as using varima9 rotation or proma9 rotation ( uses varima9 rotation by de@ault). E@ you /ant to kno/ more details E reommend you to read other books or re@erenes @or no/. E may add these details later but not sure; Nirst /e prepare the data.
Price <- c(>,7,>,*,7,>,*,>,/,3,0,*,0,/,3,0 oftware ,7,?,*,>,/ Aesthetics <- c(/,0,?,3,*,0,0,?,>,7,>,7,*,>,*,7 Drand <- c(?,0,*,/,*,/,3,?,7,*,7,>,>,*,*,7 .riend ,0,?,3,7,/,0,>,7,>,0,?,* .amily <- c(>,/,?,7,3,*,?,*,?,?,*,7,0,/,*,> data
fa <- factanal(data, factor=3 $nd you get the result.
Call5 factanal(x = data, factors = 3 !ni$enesses5 Price oftware Aesthetics Drand .riend .amily ;2*>7 ;277 ;230> ;23>7 ;27? 32;;; oadings5 .actor3 Price -;2>*@ oftware -;23*0 Aesthetics ;2/* Drand ;230 .riend ;23>3 .amily .actor3 loadings 023; Proportion )ar ;2/>* Test of the hypothesis that 3 factor is s$Hcient2 The chi s$are statistic is 3027 on degrees of freedom2 The p-%al$e is ;2370 -ere the @ator analysis is doing a null hypothesis test in /hih the null hypothesis is that the model desribed by the @ator /e have @ound predits the data /ell. "o /e have the hiDsIuare goodnessDo@D@it /hih is 12.8 and the p value is 0.1B. This means /e annot rejet the null hypothesis so the @ator predits the data /ell @rom the statistis perspetive. This is /hy the result says LTest o@ the hypothesis that 1 @ator is su@@iient. #etHs take a look at the Nator $nalysis /ith t/o @ators.
fa <- factanal(data, factor=0 Call5 factanal(x = data, factors = 0 !ni$enesses5 Price oftware Aesthetics Drand .riend .amily ;2** ;2>; ;230> ;2;@; ;2;;* ;2>; oadings5 .actor3 .actor0 Price -;2>*7 oftware -;23>3 ;233 Aesthetics ;2// Drand ;20@ ;20?0 .riend ;23;; ;20 .amily ;2>0; .actor3 .actor0 loadings 020;7 32?*/ Proportion )ar ;2/>@ ;20?0 C$m$lati%e )ar ;2/>@ ;2>3; Test of the hypothesis that 0 factors are s$Hcient2 The chi s$are statistic is 023> on ? degrees of freedom2 The p-%al$e is ;27;> The p value gets larger and the umulative portion o@ variane beomes 0.C1 (/ith one variable it is 0.5B). "o the model seems to be improved. #oadings sho/s the /eights to alulate the hidden variables @rom the observed variables.
Kut obviously the model gets improved i@ you have more variables /hih sho/s the tradeDo@@ bet/een the number o@ variables and the auray o@ the model. "o ho/ should /e deide ho/ many @ators /e should pik up! This is the topi @or the ne9t setion.
How many factors should we use, Qe @ound the t/o @ators in the e9ample /hih are* actor actor 1 2
Price
-;2>*7
oftwar -;23>3 e
;233
Aestheti ;2// cs Drand
;20@
;20?0
.riend
;23;;
;20
.amily
;2>0;
En the results o@ N$ some oe@@iients are missing but this means these oe@@iients are just too small and not neessary eIual to Rero. 6ou an see the all oe@@iients by doing like fa%&oading!'(1) /ith more preisions. $lthough the goodnessDo@D@it tells you /hether the urrent number o@ variables are su@@iient or not it does not tell /hether the number o@ variables are large enough @or desribing the in@ormation that the original data have. Nor instane /hy donHt /e try three @ators instead o@ one or t/o @ators! There are a @e/ /ays to ans/er this Iuestion.
Comprehensi!ility
This means /hether you an e9plain your ne/ variables in a sensible /ay. Nor e9ample Nator 1 has large /eight on rie $esthetis and Krand /hih may indiate /hether peop le /ant pratial aspets or @ashionable aspets on their omputers. Nator 2 has large /eights on Nriend and Namily /hih seems to mean that people around users have some e@@ets on the omputer purhase. Thus both @ators seem to have some meanings and thatHs /hy /e should keep them. This is not really a mathematial /ay to determine the number o@ @ators but is a standard /ay to do. Keause /e /ant to @ind @ators /hih e9plain something /e a n just ignore @ators /hih donHt really make sense. This is probably intuitive but E kno/ you may argue that it is too subjetive. "o /e have more mathematial /ays to determine the number o@ @ators.
C"m"lati#e #ariance
"imilar to $ you an look at the umulative portion o@ variane and i@ that reahes some numbers you an stop adding more @ators. =eiding the threshold @or the umulative portion is kind o@ heuristi. Et an be 80% similar to $. E@ your @ous is on reduing the number o@ variables it an be O0 D C0 %.
$aiser criterion
The aiser rule is to disard omponents /hose eigenv alues are belo/ 1.0. This is also used in """. 6ou an easily alulate the eigenvalues @rom the orrelation matri9.
e% <- eigen(cor(data e%'%al$es 02?*7;33/; 32>@;;;*> ;2@3*7;?7 ;2>;*@//0> ;2070@*//? ;2;@/7/3;7 "o /e an determine that the number o@ @ators should be 2. Fne problem o@ aiser rule is that it o@ten beomes too strit.
Scree plot Another %ay to deter!ine the nu!ber o+ +ators is to use ree plot. 6ou plot the o!ponents on the 9 axis( and the eigen$alues on the 6 axis and onnet the! %ith lines. 6ou then try to +ind the spot %here the slope o+ the line beo!es less steep. o( ho% exatly should %e +ind the spot like that Again( it is kind o+ heuristi. n so!e ases /partiularly %hen the nu!ber o+ your original $ariables are s!all like the exa!ple abo$e0( you an-t +ind a lear spot like that /try to !ake a plot by using the +ollo%ing ode0. 5onetheless( it is good to kno% ho% to !ake a ree plot. )he +ollo%ing proedure to !ake a ree plot is based on this %ebpage. 6ou also need n4ators pakage.
e% <- eigen(cor(data lirary(n.actors ap
eknik &enga!bilan a!pel : 5onprobability a!pling &engertian 5onprobability a!pling atau ;e+inisi 5onprobability a!pling adalah teknik penga!bilan sa!pel yang tidak !e!beri peluang atau kese!patan sa!a bagi setiap unsur atau anggota populasi untuk dipilih !en"adi sa!pel. )eknik a!pling 5onprobality ini !eliputi :a!pling iste!atis( a!pling Kuota( a!pling nsidental( &urposi$e a!pling( a!pling Jenuh( no%ball a!pling. 1. a!pling iste!atis &engertian a!pling iste!atis atau ;e+inisi a!pling iste!atis adalah teknik penga!bilan sa!pel berdasarkan urutan dari anggota populasi yang telah diberi no!or urut.'ontoh a!pling iste!atis( anggota populasi yang terdiri dari 1 orang( dari se!ua se!ua anggota populasi itu diberi no!or urut 1 sa!pai 1. &enga!bilan sa!pel dapat dilakukan dengan !enga!bil no!or gan"il sa"a( genap sa"a( atau kelipatan dari bilangan tertentu( !isalnya kelipatan dari bilangan li!a. Untuk itu !aka yang dia!bil sebagai sa!pel adalah no!or urut 1( 7( 1( 17( 2 dan seterusnya sa!pai 1. 2. a!pling Kuota &engertian a!pling Kuota atau ;e+inisi a!pling Kuota adalah teknik untuk !enentukan sa!pel dari populasi yang !e!punyai iri<iri tertentu sa!pai "u!lah kuota yang diinginkan.'ontoh a!pling Kuota( akan !elakukan penelitian tentang Karies =igi( "u!lah sa!pel yang ditentukan 7 orang( "ika pengu!pulan data belu! !e!enuhi kuota 7 orang tersebut( !aka penelitian dipandang belu! selesai. Bila pengu!pulan data dilakukan seara kelo!pok yang terdiri atas 7 orang pengu!pul data( !aka setiap anggota kelo!pok harus dapat !enghubungi 1 orang anggota sa!pel( atau 7 orang tersebut harus dapat !enari data dari 7 anggota sa!pel. 3. a!pling nsidental &engertian a!pling nsidental atau ;e+inisi a!pling nsidental adalah teknik penentuan sa!pel berdasarkan kebetulan( yaitu siapa sa"a yang seara kebetulan atau insidental berte!u dengan peneliti dapat digunakan sebagai sa!pel( bila dipandang orang yang kebetulan dite!ui itu ook sebagai su!ber data. . &urposi$e a!pling &engertian &urposi$e a!pling atau ;e+inisi &urposi$e a!pling adalah teknik penentuan sa!pel dengan perti!bangan tertentu. 'ontoh &urposi$e a!pling( akan !elakukan penelitian tentang kualitas !akanan( !aka sa!pel su!ber datanya adalah orang yang ahli !akanan. a!pel ini lebih ook digunakan untuk &enelitian Kualitati+ atau penelitian yang tidak !elakukan generalisasi. 7. a!pling Jenuh /ensus0 &engertian a!pling Jenuh atau ;e+inisi a!pling Jenuh adalah teknik penentuan sa!pel bila se!ua anggota populasi digunakan sebagai sa!pel. al ini sering dilakukan bila "u!lah populasi relati+ keil( kurang dari 3 orang( atau penelitian yang ingin !e!buat generalisasi dengan kesalahan yang sangat keil. ?. no%ball a!pling &engertian no%ball a!pling atau ;e+inisi no%ball a!pling adalah teknik penentuan sa!pel yang !ula
dala! kolo! atau angka lain@ ulangi langkah no!or C sa!pai "u!lah sa!pel yang diinginkan terapai. Ketika "u!lah sa!pel yang diinginkan telah terapai !aka langkah selan"utnya adalah !e!bagi dala! kelo!pok kontrol dan kelo!pok perlakuan sesuai dengan bentuk desain penelitian. 'ontoh Me!ilih a!pel dengan a!pling Aak eorang kepala sekolah ingin !elakukan studi terhadap para sis%a yang ada di sekolah. &opulasi sis%a MK ternyata "u!lahnya ? orang. a!pel yang diinginkan adalah 18 dari populasi. ;ia ingin !enggunakan teknik aak( untuk !enapai hal itu( dia !enggunakan langkah
!elalui penga!atan dan penatatan ge"ala