Principle Component Analysis

The Rayleigh Quotient

Nuno Vasconcelos ECE Department, p , UCSD

Principal component analysis basic idea: • if the data lives in a subspace, it is going to look very flat when viewed from the full space, e.g. 1D subspace in 2D

2D subspace in 3D

• this means that if we fit a Gaussian to the data the equiprobability contours are going to be highly skewed ellipsoids

2

The role of the mean note that the mean of the entire data is a function of the coordinate system y • if X has mean µ then X – µ has mean 0

we can always make the data have zero mean by centering • if

and

|⎤ ⎡| X = ⎢⎢ x1 K xn ⎥⎥ ⎢⎣ | | ⎥⎦

X

T c

1 T ⎛ = ⎜ I − 11 n ⎝

⎞ T ⎟X ⎠

• then Xc has zero mean

can assume that X is zero mean without loss of generality 3

Principal component analysis If y is Gaussian with covariance Σ the equiprobability q p y contours

y T Σ −1 y = K

y2

φ2 λ2

φ1 λ1 y1

are the ellipses whose • principal components φi are the eigenvectors of Σ • principal lengths λi are the eigenvalues of Σ

by detecting small eigenvalues we can eliminate dimensions that have little variance thi iis PCA this 4

PCA by SVD computation of PCA by SVD given X with one example per column • 1) create the centered data-matrix

X

T c

1 T ⎛ = ⎜ I − 11 n ⎝

⎞ T ⎟X ⎠

• 2) compute t its it SVD

X cT = ΜΠΝT • 3) principal components are columns of N, eigenvalues are

λi = π i2 / n 5

Principal component analysis a natural measure is to pick the eigenvectors that explain p % of the data variability y • can be done by plotting the ratio rk as a function of k k

rk =

∑λ i =1

2 i n

2 λ ∑ i i =1

• e.g. we need 3 eigenvectors to cover 70% of the variability of this dataset

6

Limitations of PCA PCA is not optimal for classification • note that there is no mention of the class label in the definition of PCA • keeping the dimensions of largest energy (variance) is a good idea but not always enough idea, • certainly improves the density estimation, since space has smaller dimension • but could be unwise from a classification point of view • the discriminant dimensions could be thrown out

it is not hard to construct examples where PCA is the worst possible thing we could do

7

Fischer’s linear discriminant find the line z = wTx that best separates the two classes

bad projection

good projection

( E [Z | Y = 1] − E [Z | Y = 0]) w* = max

2

Z |Y

w

Z |Y

var[Z | Y = 1] + var[Z | Y = 0]

8

Linear discriminant analysis this can be written as between class scatter

wT S B w J ( w) = T w SW w optimal solution is

S B = (µ1 − µ 0 )(µ1 − µ 0 )

T

SW = (Σ1 + Σ 0 )

within class scatter

w* = SW−1 (µ1 − µ0 ) = (Σ1 + Σ 0 ) (µ1 − µ 0 ) −1

BDR after projection on z is equivalent to BDR on x if • the two classes are Gaussian and have equal covariance

otherwise, LDA leads to a sub-optimal classifier 9

The Rayleigh quotient it turns out that the maximization of the Rayleigh quotient

wT S B w J ( w) = T w SW w

S B , SW , symmetric i positive semidefinite

appears in many problems in engineering and pattern recognition we have already seen that this is equivalent to

max wT S B w subject j to w

wT SW w = K

and can be solved using Lagrange multipliers

10

The Rayleigh quotient define the Lagrangian

(

L = wT S B w - λ wT SW w − K

)

maximize with respect to w

∇ w L = 2(S B - λSW )w = 0 to obtain the solution

S B w = λSW w this is a generalized eigenvalue problem that you can solve using any eigenvalue routine which eigenvalue? 11

The Rayleigh quotient recall that we want

max wT S B w subject to w

wT SW w = K

and the optimal w satisfies

S B w = λSW w hence

(w *)T S B w* = λ (w *)T SW w* = λK

which is maximum for the largest eigenvalue in summary, we need the generalized eigenvector eigen al e S B w = λSW w of largest eigenvalue 12

The Rayleigh quotient case 1: Sw invertible • simplifies to a standard eigenvalue problem

SW−1S B w = λw • w is the largest eigenvalue of Sw-1SB

case 2: Sw not invertible • this is case is more problematic • in fact the cost can be unbounded • consider w = wr + wn, wr in the row space of Sw and wn in the null space

wT SW w = ( wr + wn )T SW ( wr + wn ) = ( wr + wn )T SW wr = wrT SW ( wr + wn ) = wrT SW wr 13

The Rayleigh quotient and

wT S B w = ( wr + wn )T S B ( wr + wn ) = wrT S B wr + 2wrT S B wn + wnT S B wn 1 424 3 ≥0

hence, if there is a (wr,wn) such that wrTSBwn ≥ 0, • we can make the cost arbitrarily large • by simply scaling up the null space component wn

this can also be seen geometrically

14

The Rayleigh quotient recall that

wT SW w = K with SW = ΦΛΦ T are the th ellipses lli whose h • principal components φi are the eigenvectors of Sw

w2

φ2 1/λ2

φ1 1/λ1 w1

• principal lengths are 1/λi

when the eigenvalues go to zero, the ellipses blow up consider the picture of the optimization problem


wT SW w = K 15

The Rayleigh quotient max wT S B w subject to w

wT SW w = K w*

w2

wTSBw 1/λ2

1/λ1 w1

wTSww w=K K

the optimal solution is where the outer red ellipse (cost) touches the blue ellipse (constraint) • in this example, as λ1 goes to 0, ||w*|| and the cost go to infinity 16

The Rayleigh quotient how do we avoid this problem? • we introduce another constraint


wTw=L

wT SW w = K ,

w =L

w*

w2

wTSBw 1/λ2

1/λ1 w1

wTSww=K • restricts the set of possible solutions to these points (surfaces in high dimensional case) 17

The Rayleigh quotient the Lagrangian is now

(

) (

L = wT S B w - λ wT SW w − K - β wT w − L

)

and the solution satisfies

∇ w L = 2(S B - λSW − βI )w = 0 or

(S B - λ [SW + γI ])w

= 0, γ = β / λ

but this is exactly the solution of the original problem with Sw+γI instead of Sw

max wT S B w subject bj t to t w

wT [SW + γI ]w = K 18

The Rayleigh quotient adding the constraint is equivalent to maximizing the regularized g Rayleigh y g q quotient

wT S B w J ( w) = T w [SW + γI ]w

S B , SW , symmetric positive iti semidefini id fi ite t

what does this accomplish? • note that

SW = ΦΛΦ T ⇒ SW + γI = ΦΛΦ T + γΦIΦ T = Φ[Λ + γI ]Φ T

• this makes all eigenvalues positive • the matrix is no longer non-invertible 19

The Rayleigh quotient in summary

wT S B w max T w w SW w

S B , SW , symmetric positive semidefinite

1) Sw invertible • w* is the eigenvector of largest eigenvalue of Sw-1SB • the max value is λK, where λ is the largest eigenvalue

2) Sw not invertible • regularize: l i Sw -> Sw + γI • w* is the eigenvector of largest eigenvalue of [Sw + γI]-1SB • the max value is λK, where λ is the largest eigenvalue

20

Regularized discriminant analysis back to LDA: • when the within scatter matrix is non-invertible non-invertible, instead of

between class scatter

wT S B w J ( w) = T w SW w • we use

wT S B w J ( w) = T w SW w

S B = (µ1 − µ 0 )(µ1 − µ 0 )

T

SW = (Σ1 + Σ 0 )

within class scatter

S B = (µ1 − µ 0 )(µ1 − µ 0 )

T

SW = (Σ1 + Σ 0 + γI )

regularized within class scatter 21

Regularized discriminant analysis this is called regularized discriminant analysis (RDA) noting that

SW = Σ1 + Σ 0 + γI = Σ1 + γ 1 I + Σ 0 + γ 0 I

γ1 + γ 0 = γ

this can also be seen as regularizing each covariance matrix individually the regularization parameters γi are determined by crossvalidation • more on this later • basically means that we try several possibilities and keep the best 22

Principal component analysis back to PCA: given X with one example per column • 1) create the centered data-matrix

X

T c

1 T⎞ T ⎛ = ⎜ I − 11 ⎟ X n ⎝ ⎠

this has one point per row

⎡ x1T − µ T ⎤ ⎢ ⎥ T Xc = ⎢ M ⎥ ⎢ xnT − µ T ⎥ ⎣ ⎦

• note that the p projection j of all p points on p principal p component p φ is

z = Xc φ T

(

⎡ z1 ⎤ ⎡ x1 − µ ⎢M⎥=⎢ M ⎢ ⎥ ⎢ ⎢⎣ zn ⎥⎦ ⎢⎢ xn − µ ⎣

(

) φ ⎤⎥ T

⎥ T ⎥ φ⎥ ⎦

)

23

Principal component analysis and, since

(

1 1 zi = ∑ xi − µ ∑ n i n i

)

T

T

⎛1 ⎞ φ = ⎜ ∑ xi − µ ⎟ φ = 0 ⎠ ⎝n i

the sample variance of Z is given by its norm

1 2 var( z ) = ∑ zi = z n i

2

recall that PCA looks for the largest variance component

max z = max X φ 2

φ

φ

T c

2

(

)

= max X φ X cT φ = max φ T X c X cT φ φ

T c

T

φ

24

Principal component analysis recall that the sample covariance is

( )

1 1 T Σ = ∑ ( xi − µ )( xi − µ ) = ∑ xic xic n i n i

T

where xic is the ith column of Xc this can be written as c ⎡ | | − x ⎡ ⎤ 1 1⎢ c c ⎥⎢ Σ = ⎢ x1 K xn ⎥ ⎢ M n ⎢⎣ | | ⎦⎥ ⎣⎢− xnc

−⎤ ⎥ 1 T = X X ⎥ n c c −⎥⎦ 25

Principal component analysis hence the PCA problem is

ma φ T X c X cT φ = max max ma φ T Σφ φ

φ

as in LDA, this can be made arbitrarily large by simply scaling li φ to normalize we constrain φ to have unit norm

max φ T Σφ φ

subject bj t to t

φ =1

which is equivalent to

φ T Σφ max T φ φ φ shows that PCA = maximization of a Rayleigh quotient 26

Principal component analysis in this case

wT S B w max T w w SW w

S B , SW , symmetric i positive semidefinite

with SB = Σ and Sw = I Sw is clearly invertible • no regularization problems • w* is the eigenvector of largest eigenvalue of Sw-1SB • this thi is i just j t the th largest l t eigenvector i t off the th covariance i Σ • the max value is λ, where λ is the largest eigenvalue

27

The Rayleigh quotient dual let’s assume, for a moment, that the solution is of the form

w = X cα

• i.e. a linear combination of the centered datapoints

hence the problem is equivalent to

α T X cT S B X c α max T T α α X S X α c W c thi does this d nott change h its it fform, the th solution l ti iis •

α* is the eigenvector of largest eigenvalue of (XcTSwXc)-1XcTSBXc

• the max value is λK, K where λ is the largest eigenvalue

28

The Rayleigh quotient dual for PCA • Sw = I and SB = Σ = 1/n XcXcT • the solution satisfies

S B w = λS W w ⇔ −1

1

n

T c

X c X w = λw ⇔ w = X c

1

X cT w

n4 λ 24 1 3 α

• and, therefore, we have • w* eigenvalue of SB = XcXcT • α** eigenvalue i l off (XcTSwXc)-11XcTSBXc = (XcTXc)-11XcTXcXcTXc =X XcTXc

• i.e. we have two alternative manners in which to compute PCA

29

Principal component analysis primal

dual

• assemble matrix

• assemble matrix

•

•

Σ = XcXcT

Κ = XcTXc

• compute eigenvectors φi

• compute eigenvectors αi

• these are the principal components

• the principal components are •

φi = Xcαi

in both cases we have an eigenvalue problem • primal i l on th the sum off outer t products d t

i

• dual on the matrix of inner products

( )

Σ = ∑ x ic x ic

K ij = (x

)x

c T i

T

c j

30

The Rayleigh quotient this is a property that holds for many Rayleigh quotient p problems • the primal solution is a linear combination of datapoints • the dual solution only depends on dot-products of the datapoints

whenever both of these hold • the th problem bl can b be kkernelized li d • this has various interesting properties • we will talk about them

many examples • kernel PCA, kernel LDA, manifold learning, etc. 31

32

Principle Component Analysis

Recommend Documents