The Rayleigh Quotient
Nuno Vasconcelos ECE Department, p , UCSD
Principal component analysis basic idea: • if the data lives in a subspace, it is going to look very flat when viewed from the full space, e.g. 1D subspace in 2D
2D subspace in 3D
• this means that if we fit a Gaussian to the data the equiprobability contours are going to be highly skewed ellipsoids
2
The role of the mean note that the mean of the entire data is a function of the coordinate system y • if X has mean µ then X – µ has mean 0
we can always make the data have zero mean by centering • if
and
|⎤ ⎡| X = ⎢⎢ x1 K xn ⎥⎥ ⎢⎣ | | ⎥⎦
X
T c
1 T ⎛ = ⎜ I − 11 n ⎝
⎞ T ⎟X ⎠
• then Xc has zero mean
can assume that X is zero mean without loss of generality 3
Principal component analysis If y is Gaussian with covariance Σ the equiprobability q p y contours
y T Σ −1 y = K
y2
φ2 λ2
φ1 λ1 y1
are the ellipses whose • principal components φi are the eigenvectors of Σ • principal lengths λi are the eigenvalues of Σ
by detecting small eigenvalues we can eliminate dimensions that have little variance thi iis PCA this 4
PCA by SVD computation of PCA by SVD given X with one example per column • 1) create the centered data-matrix
X
T c
1 T ⎛ = ⎜ I − 11 n ⎝
⎞ T ⎟X ⎠
• 2) compute t its it SVD
X cT = ΜΠΝT • 3) principal components are columns of N, eigenvalues are
λi = π i2 / n 5
Principal component analysis a natural measure is to pick the eigenvectors that explain p % of the data variability y • can be done by plotting the ratio rk as a function of k k
rk =
∑λ i =1
2 i n
2 λ ∑ i i =1
• e.g. we need 3 eigenvectors to cover 70% of the variability of this dataset
6
Limitations of PCA PCA is not optimal for classification • note that there is no mention of the class label in the definition of PCA • keeping the dimensions of largest energy (variance) is a good idea but not always enough idea, • certainly improves the density estimation, since space has smaller dimension • but could be unwise from a classification point of view • the discriminant dimensions could be thrown out
it is not hard to construct examples where PCA is the worst possible thing we could do
7
Fischer’s linear discriminant find the line z = wTx that best separates the two classes
bad projection
good projection
( E [Z | Y = 1] − E [Z | Y = 0]) w* = max
2
Z |Y
w
Z |Y
var[Z | Y = 1] + var[Z | Y = 0]
8
Linear discriminant analysis this can be written as between class scatter
wT S B w J ( w) = T w SW w optimal solution is
S B = (µ1 − µ 0 )(µ1 − µ 0 )
T
SW = (Σ1 + Σ 0 )
within class scatter
w* = SW−1 (µ1 − µ0 ) = (Σ1 + Σ 0 ) (µ1 − µ 0 ) −1
BDR after projection on z is equivalent to BDR on x if • the two classes are Gaussian and have equal covariance
otherwise, LDA leads to a sub-optimal classifier 9
The Rayleigh quotient it turns out that the maximization of the Rayleigh quotient
wT S B w J ( w) = T w SW w
S B , SW , symmetric i positive semidefinite
appears in many problems in engineering and pattern recognition we have already seen that this is equivalent to
max wT S B w subject j to w
wT SW w = K
and can be solved using Lagrange multipliers
10
The Rayleigh quotient define the Lagrangian
(
L = wT S B w - λ wT SW w − K
)
maximize with respect to w
∇ w L = 2(S B - λSW )w = 0 to obtain the solution
S B w = λSW w this is a generalized eigenvalue problem that you can solve using any eigenvalue routine which eigenvalue? 11
The Rayleigh quotient recall that we want
max wT S B w subject to w
wT SW w = K
and the optimal w satisfies
S B w = λSW w hence
(w *)T S B w* = λ (w *)T SW w* = λK
which is maximum for the largest eigenvalue in summary, we need the generalized eigenvector eigen al e S B w = λSW w of largest eigenvalue 12
The Rayleigh quotient case 1: Sw invertible • simplifies to a standard eigenvalue problem
SW−1S B w = λw • w is the largest eigenvalue of Sw-1SB
case 2: Sw not invertible • this is case is more problematic • in fact the cost can be unbounded • consider w = wr + wn, wr in the row space of Sw and wn in the null space
wT SW w = ( wr + wn )T SW ( wr + wn ) = ( wr + wn )T SW wr = wrT SW ( wr + wn ) = wrT SW wr 13
The Rayleigh quotient and
wT S B w = ( wr + wn )T S B ( wr + wn ) = wrT S B wr + 2wrT S B wn + wnT S B wn 1 424 3 ≥0
hence, if there is a (wr,wn) such that wrTSBwn ≥ 0, • we can make the cost arbitrarily large • by simply scaling up the null space component wn
this can also be seen geometrically
14
The Rayleigh quotient recall that
wT SW w = K with SW = ΦΛΦ T are the th ellipses lli whose h • principal components φi are the eigenvectors of Sw
w2
φ2 1/λ2
φ1 1/λ1 w1
• principal lengths are 1/λi
when the eigenvalues go to zero, the ellipses blow up consider the picture of the optimization problem
max wT S B w subject to w
wT SW w = K 15
The Rayleigh quotient max wT S B w subject to w
wT SW w = K w*
w2
wTSBw 1/λ2
1/λ1 w1
wTSww w=K K
the optimal solution is where the outer red ellipse (cost) touches the blue ellipse (constraint) • in this example, as λ1 goes to 0, ||w*|| and the cost go to infinity 16
The Rayleigh quotient how do we avoid this problem? • we introduce another constraint
max wT S B w subject to w
wTw=L
wT SW w = K ,
w =L
w*
w2
wTSBw 1/λ2
1/λ1 w1
wTSww=K • restricts the set of possible solutions to these points (surfaces in high dimensional case) 17
The Rayleigh quotient the Lagrangian is now
(
) (
L = wT S B w - λ wT SW w − K - β wT w − L
)
and the solution satisfies
∇ w L = 2(S B - λSW − βI )w = 0 or
(S B - λ [SW + γI ])w
= 0, γ = β / λ
but this is exactly the solution of the original problem with Sw+γI instead of Sw
max wT S B w subject bj t to t w
wT [SW + γI ]w = K 18
The Rayleigh quotient adding the constraint is equivalent to maximizing the regularized g Rayleigh y g q quotient
wT S B w J ( w) = T w [SW + γI ]w
S B , SW , symmetric positive iti semidefini id fi ite t
what does this accomplish? • note that
SW = ΦΛΦ T ⇒ SW + γI = ΦΛΦ T + γΦIΦ T = Φ[Λ + γI ]Φ T
• this makes all eigenvalues positive • the matrix is no longer non-invertible 19
The Rayleigh quotient in summary
wT S B w max T w w SW w
S B , SW , symmetric positive semidefinite
1) Sw invertible • w* is the eigenvector of largest eigenvalue of Sw-1SB • the max value is λK, where λ is the largest eigenvalue
2) Sw not invertible • regularize: l i Sw -> Sw + γI • w* is the eigenvector of largest eigenvalue of [Sw + γI]-1SB • the max value is λK, where λ is the largest eigenvalue
20
Regularized discriminant analysis back to LDA: • when the within scatter matrix is non-invertible non-invertible, instead of
between class scatter
wT S B w J ( w) = T w SW w • we use
wT S B w J ( w) = T w SW w
S B = (µ1 − µ 0 )(µ1 − µ 0 )
T
SW = (Σ1 + Σ 0 )
within class scatter
S B = (µ1 − µ 0 )(µ1 − µ 0 )
T
SW = (Σ1 + Σ 0 + γI )
regularized within class scatter 21
Regularized discriminant analysis this is called regularized discriminant analysis (RDA) noting that
SW = Σ1 + Σ 0 + γI = Σ1 + γ 1 I + Σ 0 + γ 0 I
γ1 + γ 0 = γ
this can also be seen as regularizing each covariance matrix individually the regularization parameters γi are determined by crossvalidation • more on this later • basically means that we try several possibilities and keep the best 22
Principal component analysis back to PCA: given X with one example per column • 1) create the centered data-matrix
X
T c
1 T⎞ T ⎛ = ⎜ I − 11 ⎟ X n ⎝ ⎠
this has one point per row
⎡ x1T − µ T ⎤ ⎢ ⎥ T Xc = ⎢ M ⎥ ⎢ xnT − µ T ⎥ ⎣ ⎦
• note that the p projection j of all p points on p principal p component p φ is
z = Xc φ T
(
⎡ z1 ⎤ ⎡ x1 − µ ⎢M⎥=⎢ M ⎢ ⎥ ⎢ ⎢⎣ zn ⎥⎦ ⎢⎢ xn − µ ⎣
(
) φ ⎤⎥ T
⎥ T ⎥ φ⎥ ⎦
)
23
Principal component analysis and, since
(
1 1 zi = ∑ xi − µ ∑ n i n i
)
T
T
⎛1 ⎞ φ = ⎜ ∑ xi − µ ⎟ φ = 0 ⎠ ⎝n i
the sample variance of Z is given by its norm
1 2 var( z ) = ∑ zi = z n i
2
recall that PCA looks for the largest variance component
max z = max X φ 2
φ
φ
T c
2
(
)
= max X φ X cT φ = max φ T X c X cT φ φ
T c
T
φ
24
Principal component analysis recall that the sample covariance is
( )
1 1 T Σ = ∑ ( xi − µ )( xi − µ ) = ∑ xic xic n i n i
T
where xic is the ith column of Xc this can be written as c ⎡ | | − x ⎡ ⎤ 1 1⎢ c c ⎥⎢ Σ = ⎢ x1 K xn ⎥ ⎢ M n ⎢⎣ | | ⎦⎥ ⎣⎢− xnc
−⎤ ⎥ 1 T = X X ⎥ n c c −⎥⎦ 25
Principal component analysis hence the PCA problem is
ma φ T X c X cT φ = max max ma φ T Σφ φ
φ
as in LDA, this can be made arbitrarily large by simply scaling li φ to normalize we constrain φ to have unit norm
max φ T Σφ φ
subject bj t to t
φ =1
which is equivalent to
φ T Σφ max T φ φ φ shows that PCA = maximization of a Rayleigh quotient 26
Principal component analysis in this case
wT S B w max T w w SW w
S B , SW , symmetric i positive semidefinite
with SB = Σ and Sw = I Sw is clearly invertible • no regularization problems • w* is the eigenvector of largest eigenvalue of Sw-1SB • this thi is i just j t the th largest l t eigenvector i t off the th covariance i Σ • the max value is λ, where λ is the largest eigenvalue
27
The Rayleigh quotient dual let’s assume, for a moment, that the solution is of the form
w = X cα
• i.e. a linear combination of the centered datapoints
hence the problem is equivalent to
α T X cT S B X c α max T T α α X S X α c W c thi does this d nott change h its it fform, the th solution l ti iis •
α* is the eigenvector of largest eigenvalue of (XcTSwXc)-1XcTSBXc
• the max value is λK, K where λ is the largest eigenvalue
28
The Rayleigh quotient dual for PCA • Sw = I and SB = Σ = 1/n XcXcT • the solution satisfies
S B w = λS W w ⇔ −1
1
n
T c
X c X w = λw ⇔ w = X c
1
X cT w
n4 λ 24 1 3 α
• and, therefore, we have • w* eigenvalue of SB = XcXcT • α** eigenvalue i l off (XcTSwXc)-11XcTSBXc = (XcTXc)-11XcTXcXcTXc =X XcTXc
• i.e. we have two alternative manners in which to compute PCA
29
Principal component analysis primal
dual
• assemble matrix
• assemble matrix
•
•
Σ = XcXcT
Κ = XcTXc
• compute eigenvectors φi
• compute eigenvectors αi
• these are the principal components
• the principal components are •
φi = Xcαi
in both cases we have an eigenvalue problem • primal i l on th the sum off outer t products d t
i
• dual on the matrix of inner products
( )
Σ = ∑ x ic x ic
K ij = (x
)x
c T i
T
c j
30
The Rayleigh quotient this is a property that holds for many Rayleigh quotient p problems • the primal solution is a linear combination of datapoints • the dual solution only depends on dot-products of the datapoints
whenever both of these hold • the th problem bl can b be kkernelized li d • this has various interesting properties • we will talk about them
many examples • kernel PCA, kernel LDA, manifold learning, etc. 31
32