This document is about EJML. Efficient Java Matrix Library. EJML is used in development of PCA. PCA is a face recognition technique used in developing an indentification and veificaton syste…Full description
EDUCACIONDescripción completa
Descripción completa
Asignatura a Discresión para BGU
diseño de pavimentos por el metodo PCADescripción completa
Documento amelia gallegos
Planificación Curricular Anual de Investigación.Descripción completa
UAS Teknik Komputasi PCA Aris Haryanto
PLAN CURRICULAR ANUAL DE DIBUJO TÉCNICO TERCERO BGUDescripción completa
mateDescripción completa
Descripción: Cultura Estetica
PCA.Descripción completa
plan
PCA EDUCACION CULTURAL Y ARTISTICADescripción completa
Full description
Descripción: Plan anual de Filosofía, para el bachillerato ecuatoriano
Practical Guide to Principal Component Analysis (PCA) in R & Python PY!"# %
SHARE MANI SH S ARA SA! " MARCH #$" #%$ ' %
Introduction Too much of anything is good for nothing!
hat happens hen a data set has too many *aria+les , Here are -e possi+le situations hich you mi.ht come across/ $0 1o 1ou u -ind that most most o- the the *aria+le *aria+les s are correla correlated0 ted0 #0 1ou 1ou lose patience patience and and decide decide to run a model model on hole hole data0 data0 !his !his returns returns poor accuracy and you -eel terri+le0 0 1o 1ou u +ecome +ecome inde indecisi*e cisi*e a+ou a+outt hat hat to do 20 1o 1ou u start thin3in. thin3in. o- some strate strate.ic .ic method method to -ind -e important important *aria+le *aria+les s
!rust !r ust me" dea dealin lin. . it ith h suc such h sit situat uation ions s isn isn4t 4t as didi--ic -icult ult as it sou sounds nds00 Sta Statis tistic tical al techni5ues such as -actor analysis" principal component analysis help to o*ercome such di--iculties0
In this post" I4*e e6plained the concept o- principal component analysis in detail0 I4*e 3ept the e6planation to +e simple and in-ormati*e0 7or practical understandin." I4*e also demonstrated usin. this techni5ue in R ith interpretations0 Note: Understanding this concept requires prior knowledge of statistics.
hat is Principal Component Analysis , In simple ords" principal component analysis is a method o- e6tractin. important *aria+les -rom a lar.e set o- *aria+les a*aila+le in a data set0 It e6tracts lo dimensional set o- -eatures -rom a hi.h dimensional data set ith a moti*e to capture as much in-ormation as possi+le0 ith -eer *aria+les" *isuali8ation also +ecomes much more meanin.-ul0 PCA is more use-ul hen dealin. ith or hi.her dimensional data0 It is alays per-ormed on a symmetric correlation or co*ariance matri60 !his means the matri6 should +e numeric and ha*e standardi8ed data0 9et4s understand it usin. an e6ample/ 9et4s say e ha*e a data set o- dimension %% (n) : ;% ( p)0 n represents the num+er o- o+ser*ations and p represents num+er o- predictors0 Since e ha*e a lar.e p < ;%" there can +e p&p-1'/2 scatter plots i0e more than $%%% plots possi+le to analy8e
the *aria+le relationship0 ouldn4t is +e a tedious =o+ to per-orm e6ploratory analysis on this data , In this case" it ould +e a lucid approach to select a su+set o- p (p << 5 predictor hich captures as much in-ormation0 7olloed +y plottin. the o+ser*ation in the resultant lo dimensional space0 !he ima.e +elo shos the trans-ormation o- a hi.h dimensional data ( dimension) to lo dimensional data (# dimension) usin. PCA0 Not to -or.et" each resultant dimension is a linear com+ination o- p -eatures
Source/ nlpca
hat are principal components , A principal component is a normali8ed linear com+ination o- the ori.inal predictors in a data set0 In ima.e a+o*e" "#$ and "#% are the principal components0 9et4s say e ha*e a set o- predictors as()* (+...*(
p
!he principal component can +e ritten as/ ,) ))() +)(+ )( .... >p)(p
here" •
?@ is -irst principal component
•
p) is the loadin. *ector comprisin. o- loadin.s ( )* +..) o- -irst principal
component0 !he loadin.s are constrained to a sum o- s5uare e5uals to $0 !his is +ecause lar.e ma.nitude o- loadin.s may lead to lar.e *ariance0 It also de-ines the direction o- the principal component (?@) alon. hich data *aries the most0 It results in a line in p dimensional space hich is closest to the n o+ser*ations0 Closeness is measured usin. a*era.e s5uared euclidean distance0 •
()..(p are normali8ed predictors0 Normali8ed predictors ha*e mean e5uals to 8ero
and standard de*iation e5uals to one0
!here-ore" First principal component is a linear com+ination o- ori.inal predictor *aria+les
hich captures the ma6imum *ariance in the data set0 It determines the direction o- hi.hest *aria+ility in the data0 9ar.er the *aria+ility captured in -irst component" lar.er the in-ormation captured +y component0 No other component can ha*e *aria+ility hi.her than -irst principal component0 !he -irst principal component results in a line hich is closest to the data i0e0 it minimi8es the sum o- s5uared distance +eteen a data point and the line0 Similarly" e can compute the second principal component also0
Second principal component (,+) is also a linear com+ination o- ori.inal predictors
hich captures the remainin. *ariance in the data set and is uncorrelated ith ,)0 In other ords" the correlation +eteen -irst and second component should is 8ero0 It can +e represented as/ ,+ )+() ++(+ +( .... p2(p
I- the to components are uncorrelated" their directions should +e ortho.onal (ima.e +elo)0 !his ima.e is +ased on a simulated data ith # predictors0 Notice the
direction o- the components" as e6pected they are ortho.onal0 !his su..ests the correlation +' these components in 8ero0
All succeedin. principal component -ollos a similar concept i0e0 they capture the remainin. *ariation ithout +ein. correlated ith the pre*ious component0 In .eneral" -or n & p dimensional data" min(n'$ p principal component can +e constructed0 !he directions o- these components are identi-ied in an unsuper*ised ay i0e0 the response *aria+le(1) is not used to determine the component direction0 !here-ore" it is an unsuper*ised approach0 Note: "artial least square (")* is a super+ised alternati+e to "#,. ")* assigns higher weight to +aria-les which are strongly related to response +aria-le to determine principal components.
hy is normali8ation o- *aria+les necessary ,
!he principal components are supplied ith normali8ed *ersion o- ori.inal predictors0 !his is +ecause" the ori.inal predictors may ha*e di--erent scales0 7or e6ample/ Ima.ine a data set ith *aria+les4 measurin. units as .allons" 3ilometers" li.ht years etc0 It is de-inite that the scale o- *ariances in these *aria+les ill +e lar.e0 Per-ormin. PCA on unnormali8ed *aria+les ill lead to insanely lar.e loadin.s -or *aria+les ith hi.h *ariance0 In turn" this ill lead to dependence o- a principal component on the *aria+le ith hi.h *ariance0 !his is undesira+le0 As shon in ima.e +elo" PCA as run on a data set tice (ith unscaled and scaled predictors)0 !his data set has B2% *aria+les0 1ou can see" -irst principal component is dominated +y a *aria+le ItemMRP0 And" second principal component is dominated +y a *aria+le Itemei.ht0 !his domination pre*ails due to hi.h *alue o- *ariance associated ith a *aria+le0 hen the *aria+les are scaled" e .et a much +etter representation o- *aria+les in #D space0
Implement PCA in R & Python (ith interpretation)
Ho many principal components to choose , I could di*e deep in theory" +ut it ould +e +etter to anser these 5uestion practically0 7or this demonstration" I4ll +e usin. the data set -rom i. Mart Prediction Challen.e0 Remem+er" PCA can +e applied only on numerical data0 !here-ore" i- the data has cate.orical *aria+les they must +e con*erted to numerical0 Also" ma3e sure you ha*e done the +asic data cleanin. prior to implementin. this techni5ue0 9et4s 5uic3ly -inish ith initial data loadin. and cleanin. steps/ directory path path - 4.../5ata/ig78art79ales4 set woring directory setwd&path' load train and test ;le train - read.csv&4train7ig.csv4' test - read.csv&4test7ig.csv4' add a column test<=tem7"utlet79ales - 1 combine the data set combi - rbind&train* test' impute missing values with median combi<=tem7>eight?is.na&combi<=tem7>eight'@ median&combi<=tem7>eight* na.rm %AB' impute 0 with median combi<=tem7Cisibility - iDelse&combi<=tem7Cisibility 0* median&combi<=tem7Cisibility'*
!ill here" e4*e imputed missin. *alues0 No e are le-t ith remo*in. the dependent (response) *aria+le and other identi-ier *aria+les( i- any)0 As e said
a+o*e" e are practicin. an unsuper*ised learnin. techni5ue" hence response *aria+le must +e remo*ed0 remove the dependent and identi;er variables my7data - subset&combi* select -c&=tem7"utlet79ales* =tem7=denti;er* "utlet7=denti;er''
9et4s chec3 the a*aila+le *aria+les ( a030a predictors) in the data set0 chec available variables colnames&my7data'
Since PCA or3s on numeric *aria+les" let4s see i- e ha*e any *aria+le other than numeric0 chec variable class str&my7data' Fdata.DrameF: 1G20G obs. oD H variables: < =tem7>eight : num H.3 I.H2 1J.I 1H.2 K.H3 ... < =tem7Lat7Montent : Lactor w/ I levels 4NL4*4low Dat4*..: 3 I 3 I 3 I I 3 I I ... < =tem7Cisibility : num 0.016 0.01H3 0.016K 0.0IG 0.0IG ... < =tem7ype : Lactor w/ 16 levels 4aing Ooods4*..: I 1I 11 J 10 1 1G 1G 6 6 ... < =tem78%P : num 2GH.K GK.3 1G1.6 1K2.1 I3.H ... < "utlet7Bstablishment7Year: int 1HHH 200H 1HHH 1HHK 1HKJ 200H 1HKJ 1HKI 2002 200J ... < "utlet79iEe : Lactor w/ G levels 4"ther4*4!igh4*..: 3 3 3 1 2 3 2 3 1 1 ... < "utlet7Nocation7ype : Lactor w/ 3 levels 4ier 14*4ier 24*..: 1 3 1 3 3 3 3 3 2 2 ... < "utlet7ype : Lactor w/ G levels 4Orocery 9tore4*..: 2 3 2 1 2 3 2 G 2 2 ...
Sadly" out o- F *aria+les are cate.orical in nature0 e ha*e some additional or3 to do no0 e4ll con*ert these cate.orical *aria+les into numeric usin. one hot encodin.0 load library library&dummies'
create a dummy data Drame new7my7data - dummy.data.Drame&my7data* names c&4=tem7Lat7Montent4*4=tem7ype4*
4"utlet7Bstablishment7Year4*4"utlet79iEe4*
4"utlet7Nocation7ype4*4"utlet7ype4''
!o chec3" i- e no ha*e a data set o- inte.er *alues" simple rite/ chec the data set str&new7my7data'
And" e no ha*e all the numerical *alues0 e can no .o ahead ith PCA0 !he +ase R -unction prcomp() is used to per-orm PCA0 y de-ault" it centers the *aria+le to ha*e mean e5uals to 8ero0 ith parameter scale. " e normali8e the *aria+les to ha*e standard de*iation e5uals to $0 principal component analysis prin7comp - prcomp&new7my7data* scale. ' names&prin7comp' ?1@ 4sdev4
4rotation4 4center4 4scale4
44
!he prcomp() -unction results in ; use-ul measures/ $0 center and scale re-ers to respecti*e mean and standard de*iation o- the *aria+les that are used -or normali8ation prior to implementin. PCA outputs the mean oD variables prin7comp
#0 !he rotation measure pro*ides the principal component loadin.0 Each column o- rotation matri6 contains the principal component loadin. *ector0 !his is the most important measure e should +e interested in0 prin7comp
!his returns 22 principal components loadin.s0 Is that correct , A+solutely0 In a data set" the ma6imum num+er o- principal component loadin.s is a minimum o- (n$" p)0 9et4s loo3 at -irst 2 principal components and -irst ; ros0 prin7compeight
PM2
PM3
PMG
0.00IGG2H22I -0.0012KI666 0.0112G61HG
0.011KKJ106 =tem7Lat7MontentNL
-0.0021HK331G
0.003J6KIIJ -0.00HJH00HG -
0.016JKHGK3 =tem7Lat7Montentlow Dat -0.001H0G2J10
0.001K66H0I -0.003066G1I -
0.01K3H61G3 =tem7Lat7MontentNow Lat
0.002JH36G6J -0.00223G32K 0.02K30HK11
0.0I6K22JGJ =tem7Lat7Montentreg
0.0002H3631H
0.001120H31 0.00H0332IG -
0.00102661I
0 In order to compute the principal component score *ector" e don4t need to multiply the loadin. ith data0 Rather" the matri6 6 has the principal component score *ectors in a $2#%2 : 22 dimension0 dim(princomp6) $J $2#%2
22
9et4s plot the resultant principal components0 biplot&prin7comp* scale 0'
!he parameter scale 0 ensures that arros are scaled to represent the loadin.s0 !o ma3e in-erence -rom ima.e a+o*e" -ocus on the e6treme ends (top" +ottom" le-t" ri.ht) o- this .raph0 e
in-er
than
-irst
principal
component
corresponds
to
a
measure
o-
Kutlet!ypeSupermar3et" KutletEsta+lishment1ear #%%L0 Similarly" it can +e said that the second component corresponds to a measure o- Kutlet9ocation!ype!ier$" KutletSi8eother0 7or e6act measure o- a *aria+le in a component" you should loo3 at rotation matri6(a+o*e) a.ain0
20 !he prcomp() -unction also pro*ides the -acility to compute standard de*iation o- each principal component0 sde+ re-ers to the standard de*iation o- principal components0 compute standard deviation oD each principal component std7dev - prin7comp
e aim to -ind the components hich e6plain the ma6imum *ariance0 !his is +ecause" e ant to retain as much in-ormation as possi+le usin. these components0 So" hi.her is the e6plained *ariance" hi.her ill +e the in-ormation contained in those components0 !o compute the proportion o- *ariance e6plained +y each component" e simply di*ide the *ariance +y sum o- total *ariance0 !his results in/ proportion oD variance eplained prop7vare - pr7var/sum&pr7var' prop7vare?1:20@ ?1@ 0.103J1KI3 0.0J312HIK 0.0623K01G 0.0IJJI20J 0.0GHHIK00 0.0GIK02JG ?J@ 0.0G3H10K1 0.02KI6G33 0.02J3IKKK 0.026IGJJG 0.02IIHKJ6 0.02II6JHJ ?13@ 0.02IGHI16 0.02I0KK31 0.02GH3H32 0.02GH0H3K 0.02G6K313 0.02GG6016 ?1H@ 0.023H036J 0.023J111K
!his shos that -irst principal component e6plains $%0 *ariance0 Second component e6plains L0 *ariance0 !hird component e6plains 0# *ariance and so on0 So" ho do e decide ho many components should e select -or modelin. sta.e ,
!he anser to this 5uestion is pro*ided +y a scree plot0 A scree plot is used to access components or -actors hich e6plains the most o- *aria+ility in the data0 It represents *alues in descendin. order0 scree plot plot&prop7vare* lab 4Principal Momponent4* ylab 4Proportion oD Cariance Bplained4* type 4b4'
!he plot a+o*e shos that B % components e6plains around F02 *ariance in the data set0 In order ords" usin. PCA e ha*e reduced 22 predictors to % ithout compromisin. on e6plained *ariance0 !his is the poer o- PCA 9et4s do a con-irmation chec3" +y plottin. a cumulati*e *ariance plot0 !his ill .i*e us a clear picture o- num+er o- components0 cumulative scree plot plot&cumsum&prop7vare'* lab 4Principal Momponent4* ylab 4Mumulative Proportion oD Cariance Bplained4* type 4b4'
!his plot shos that % components results in *ariance close to B F0 !here-ore" in this case" e4ll select num+er o- components as % PC$ to PC%J and proceed to the modelin. sta.e0 !his completes the steps to implement PCA in R0 7or modelin." e4ll use these % components as predictor *aria+les and -ollo the normal procedures0
For Python Users: !o implement PCA in python" simply import PCA -rom s3learn
li+rary0 !he interpretation remains same as e6plained -or R users a+o*e0 K-course" the result is some as deri*ed a-ter usin. R0 !he data set used -or Python is a cleaned *ersion here missin. *alues ha*e +een imputed" and cate.orical *aria+les are con*erted into numeric0 import numpy as np Drom slearn.decomposition import PMR import pandas as pd import matplotlib.pyplot as plt Drom slearn.preprocessing import scale Smatplotlib inline
Noad data set data pd.read7csv&Fig78art7PMR.csvF' convert it to numpy arrays (data.values 9caling the values ( scale&(' pca PMR&n7componentsGG' pca.;t&(' he amount oD variance that each PM eplains var pca.eplained7variance7ratio7 Mumulative Cariance eplains var1np.cumsum&np.round&pca.eplained7variance7ratio7* decimalsG'T100' print var1 ? 10.3J 1J.6K 23.H2 2H.J
7or more in-ormation on PCA in python" *isit sci3it learn documentation0
Points to Remem+er $0 PCA is used to e6tract important -eatures -rom a data set0 #0 !hese -eatures are lo dimensional in nature0 0 !hese -eatures a030a components are a resultant o- normali8ed linear com+ination o- ori.inal predictor *aria+les0 20 !hese components aim to capture as much in-ormation as possi+le ith hi.h e6plained *ariance0 ;0 !he -irst component has the hi.hest *ariance -olloed +y second" third and so on0 0 !he components must +e uncorrelated (remem+er ortho.onal direction , )0 See a+o*e0 L0 Normali8in. data +ecomes e6tremely important hen the predictors are measured in di--erent units0 0 PCA or3s +est on data set ha*in. or hi.her dimensions0 ecause" ith hi.her dimensions" it +ecomes increasin.ly di--icult to ma3e interpretations -rom the resultant cloud o- data0 F0 PCA is applied on a data set ith numeric *aria+les0 $%0 PCA is a tool hich helps to produce +etter *isuali8ations o- hi.h dimensional data0