Solutions to Steven Kay’s Statistical Estimation book Satish Bysany Aalto University School of Electrical Engineering March 1, 2011 [section]
1
Intr Introdu oduct ctio ion n
This is as set of notes describing solutions to Steven Kay’s book Fundamentals of Statistic Statistical al Signal Proce Processing: ssing: Estimation Estimation Theory Theory . A brief brief revi review ew of notation is in order. 1.1 1.1
Nota Notati tion on
• I is identity matrix. • 0 represents a matrix or vector of all zeros. • e is a column vector of all ones. • J is exchange matrix, with 1s on the anti-diagonal and 0s elsewhere. • e is a column vector whose j th element is 1, rest all 0. • a·b =. aH b is the dot product of a of a and b and b • ∂ ∂ f ( × 1 f (t) is the derivative of a scalar function f ( f (t) depending on M on M × j
t
real vector vector parameter parameter t t,, is defined by
∂ f ( f (t) = ∂ t
1
∂ f (t) ∂t 1 f ( ∂ f (t) ∂t 2 f (
.. . ∂ f (t) ∂t M f (
•
∂ ∂t h(t)
is the derivative of a M upon a scalar value t.
× 1 real vector function h(t) depending
∂ f (t) = ∂t
2
∂ ∂t f 1 (t)
∂ ∂t f 2 (t)
.. . ∂ ∂t f M (t)
Chapter 2
Solutions to Problems in Chapter 2 2.1
Problem 2.1
{
− }
The data x = x[0], x[1], . . . , x[N 1] are observed where the x[n]’s are i.i.d. as (0, σ 2 ). We wish to estimate the variance σ 2 as
N
1 σ ˆ = N 2
N −1
x2 [n]
(1)
n=0
∀
Solution From the problem definition, it follows that,
n,
µ = E (x[n]) = 0 σ 2 = E (x[n]
·
2
− µ)
= E x2 [n]
Now take the E ( ) operator on both sides of Eq(1) and using the fact that, for any two random variables X and Y , E (X + Y ) = E (X ) + E (Y ) 1 E σ ˆ = N
N −1
1 E x [n] = N
N −1
2
2
n=0
n=0
σ2 =
Nσ 2 = σ 2 N
(2)
Hence the estimator 1 is unbiased. Note that, this result holds even if the x[n]’s are not independent! Next, applying the variance operator var( ) on both sides of Eq(1) and using the fact that, for independent random variables X and Y ,
·
var(aX + bY ) = a 2 var(X ) + b2 var(Y ) 2
1 var σ ˆ = 2 N 2
N −1
var x2 [n]
(3)
n=0
∼ N
Let X (0, 1) be normal distribution with zero-mean and unit variance. Then, by definition, Y = X 2 χ21 is chi-square distributed with 1 degree of freedom. We know that mean χ2n = n, var χ2n = 2n, so, var(Y ) = var(X 2 ) = 2 1 = 2. Introducing Z = σX , implies that var(Z ) = σ 2 var(X ) = σ 2 . Since E (Z ) = σE (X ) = 0, we conclude Z (0, σ 2 ). Now consider var(Z 2 ) = var(σ 2 X 2 ) = σ 4 var(X 2 ) = 2σ 4. Since each of x[n] (0, σ 2 ), we have,
∼
·
∼ N
∼ N
var(x2 [0]) = var(x2 [1]) =
2
··· = var(x [N − 1]) = 2σ
4
Hence, Eq(3) simplifies to 1 var σ ˆ = 2 N
N −1
→ 2
→ ∞, var
As N 2.2
σ ˆ2
(2σ 4 ) =
n=0
2σ 4 N 2σ 4 = N 2 N
(4)
0.
Problem 2.5
{
}
Two samples x[0], x[1] are independently observed from bution. The estimator σ ˆ2 =
1 2 x [0] + x2 [1] 2
2
N 0, σ
distri(5)
is unbiased. Find the PDF of ˆσ 2 to determine if it is symmetric about σ 2 Solution Consider two standard normal random variables X 0 and X 1 , that is, X i (0, 1) , i = 0, 1. Then, by definition, X = X 02 + X 12 is χ2 (n)-distributed with n = 2 degrees of freedom. It’s PDF is
∼ N
1 f X (x) = e−x/2 2
x > 0
Let x[0] = σX 0 and x[1] = σX 1 . Then x2 [0] + x2 [1] = σ 2 (X 02 + X 12 ) = σ 2 X
⇒
=
σ ˆ2 =
σ2 X 2 3
from Eq(5)
We know that, for two continuous random varaibles X and Y related as Y = aX + b, 1 y b f Y (y) = f X a a Taking a =
||
σ2 2
−
, b = 0, θ = σ 2 , the PDF of σ ˆ 2 is
1 y 2 f σˆ (y; θ) = f X = 2 a a σ 2
1 −y 1 1 e a = 2 e−y/σ = e−y/θ 2 σ θ 2
2
y > 0
−
It’s obvious that f σˆ (y; θ) = f σˆ (y; θ), so the PDF is not symmetric about θ = σ 2 . Note carefully that the PDF is symmetric about σ, not σ 2 . 2
3 3.1
2
Chapter 3: CRLB Formulas
Let a random variable X depend on some parameter t. We write the PDF of X as f X (x; t) – it represents a family of PDFs, each one with a different value of t. When the PDF is viewed as a function of t for a given, fixed value of x, it is termed as likelihood function. We define, the log-likelihood function as . . L(t) = L X (t x) = ln f X (x; t)
|
(6)
Note that t is a deterministic, but unknown parameter. We simply write it as L(t) when the random variable X is known from context. For the sake of notation, we define ∂ ∂ 1 ∂ ˙ L = L(t) = ln f X (x; t) = f X (x; t) ∂t ∂t f X (x; t) ∂t ∂ 2 ∂ 2 ¨ L = 2 L(t) = 2 ln f X (x; t) ∂t ∂t
(7) (8)
Taking the expectation w.r.t X , if the regularity condition ˙ = 0 E (L)
(9)
is satisfied, then there exists a lower bound on the variance of an unbiased estimator tˆ, var(tˆ)
≥ −E 1(L) ¨ 4
(10)
Furthermore, for the equality sign, and for all t, var(tˆ) =
·
1 ¨ E (L)
−
⇐⇒
˙ L = g(t)(h(x)
− t) ⇐⇒
tˆ = h(x)
(11)
·
where g( ) and h( ) are some functions. Note that the above applies only for unbiased estimates, so E (tˆ) = t = E [h(x)]. The minimum variance is also given by, 1 1 ¨ var(tˆ) = = = g(t) = E (L) (12) ¨ g(t) E (L)
⇒
−
−
Note: tˆ is an estimate of t. Hence, tˆ cannot depend on t itself (if it does, such an estimate is useless!). So the result ˆt = h(x) intuitively makes sense, because tˆ depends only on the observed, given data x and not at all on t. But the mean and variance of tˆ generally do depend on t and that is ok ! For the MVUE case, mean E (tˆ) = t and variance var(tˆ) = g(t) – both are purely functions of t alone. Replacing the scalar random variable X by a vector of random variables x, the results still hold. Facts
• Identity, if the regularity condition is satisfied, then ¨ E L˙ = −E L
2
• Fisher information I (t) for data x is defined by ¨ I (t) = −E (L) So, the minimum variance is the reciprocal of Fisher information. The “more the information”, the lower is the CRLB.
• For a deterministic signal s[n; t] with an unknown parameter t in zeromean AWGN w[n] ∼ N 0, σ , 2
x[n] = s[n; t] + w[n]
n = 1, 2, . . . , N
the minimum variance (the CRLB, if it exists) is given by var tˆ
σ2
≥ N −1 n=0
5
2 ∂ s[n; t] ∂t
=
σ2 ∂ ∂t s
2
• For an estimate tˆ of t, if the CRLB is known, then for any transformation τ = g(t) for some function g (·) has the new CRLB CRLBτ = CRLBt
∂ g(t) ∂t
2
• The CRLB always increases as we estimate more parameters for same given data.
Let θ = [θ1 , θ2 , . . . , θM ]T be a vector parameter. Assume that an estiT mator θˆ = θˆ1 , ˆ θ2 , . . . , ˆ θM is unbiased, that is,
E (θˆ) = θ
⇐⇒
E (θˆi ) = θ i
The M M Fisher information matrix I(θ ) is a matrix, whose (i, j)th element is given by
×
−
[I(θ )]i,j =
∂ 2 ln p(x; θ ) E ∂θ i ∂θ j
Note that p(x; θ ) is a scalar function, depending on vector parameters x and θ . For example, if w[n] is i.i.d 0, σ 2 and x[n] = θ 1 + nθ2 + w[n], then
−
N
p(x; θ ) =
1 exp (2πσ 2)N/2
1 2σ 2
N
(x[n]
n=1
2
− θ − nθ ) 1
Say x = [1, 2, 5, 3], θ = [1, 2], σ = 2 implies p(x; θ ) = 1.89
2
3
× 10− .
Note: The Fisher matrix is symmetric, because the partial derivatives do not depend on order of evaluation. If the regularity condition
∂ E ln p(x; θ ) = 0 ∂ θ
∀θ
is satisfied (where the expectation is taken w.r.t p(x; θ )) then the covariance matrix of any unbiased estimator θˆ satisfies Cθˆ
1
1
− I− (θ) ≥ 0 ⇐⇒ var(θi) ≥ [I− (θ)]i,i
6
Note: [I−1 (θ )]i,i means first you calculate the whole matrix inverse and then take the (i, i)th element. The covariance matrix of any vector y is given by µy = E (y)
Cy = E (y
− µ )(y − µ )T y
y
Furthermore, an estimator that attains the lower bound, Cθˆ = I −1 (θ )
⇐⇒
∂ ln p(x; θ ) = I( θ )(g(x) ∂ θ
− θ)
×
for some M -dimensional function g and some M M matrix I . That estimator, which is the MVUE, is θˆ = g(x), and its covariance matrix is I−1(θ ). 3.2
Problem 3.1
−
U (0, θ), show that the
If x[n] for n = 0, 1, . . . , N 1 are i.i.d. according to regularity condition does not hold. That is,
∂ E ln p(x; θ) = 0 ∂θ
∀ θ > 0
Solution By definition of the expectation operator,
∂ E ln p(x; θ) = ∂θ
∂ ln p(x; θ) p(x; θ) dx = ∂θ
∂ p(x; θ) dx (13) ∂θ
−
follows from Eq(7). Denote the N random variables as xi = x[i 1] for i = 1, 2, . . . , N . It is given in the problem that their PDFs are identical: p(xi ; θ) = and
1/θ 0
≤
0 < xi θ otherwise
θ
p(xi ; θ) dx i = 1
0
The multiple integral in Eq(13) simplifies to product of integrals
∂ p(x; θ) dx = ∂θ
θ
0
∂ p(x1 ; θ) dx 1 ∂θ 7
θ
··· 0
∂ p(xN ; θ) dx N ∂θ
because the xi ’s are independent. Note that the limits of the integral depend on θ, so we cannot interchange the order of differentiation and integration, θ
0
∂ ∂ p(xi ; θ) dx i = ∂θ ∂θ
θ
p(xi ; θ) dx i
0
Hence, the regularity condition fails to hold. In fact, LHS= 1/θ, but RHS=0! 3.3
Problem 3.3
The data x[n] = Ar n + w[n] for n = 0, 1, , N 1 are observed, where w[n] is WGN with variance σ 2 and r > 0 is known. Find the CRLB of A. Show that an efficient estimator exists and find its variance.
···
−
Solution Assuming that x[i]’s are statistically independent, the joint PDF is N −1
p(x; A) =
i=0
1 exp (2πσ 2 )1/2
⇒
⇒
=
ln p(x; A) =
ln 2πσ 2
∂ 1 ln p(x; A) = 2 ∂A σ
Since the sum
1 2σ 2
)
N −1
1 2σ 2
N/2
− Ar
n 2
(x[n]
n=0 N −1
(x[n]
n=0
− Arn)
2
2
− Arn)
N −1
rn (x[n]
Arn )
n=0
N −1
S =
1 (x[n] 2σ 2
− − − −
1 = exp (2πσ 2 )N/2 =
−
r
2n
r2N −1 r2 −1
=
N
n=0
r=1 r = 1
is deterministic and known (because both r and N are known), the above equation simplifies to ∂ 1 ln p(x; A) = 2 ∂A σ S ˙ L = σ2
N −1
n=0 N −1 n=0
r n x[n]
− AS
rn x[n] S
−A
= g(A)(h(x) 8
− A)
(14) (15) (16)
where g(A) = S/σ 2 is a constant (doesn’t even depend on A!) and N −1
h(x) =
n=0
rn x[n] S
is depends on x but not on A. Hence, from Theorem 3.1, the MVUE estimate ˆ A is N −1 1 ˆ A = h(x) = r n x[n] S
n=0
ˆ and the variance of A satisfies ˆ var(A)
≥
σ2 S
1 σ2 CRLB = = g(A) S
and
We can also find the second derivative, from Eq(14), ∂ 2 S ¨ L = ln p(x; A) = (0 ∂A 2 σ2
− 1)
¨ and, in our case, E [L] ¨ = L because ¨ and, as required, CRLB = 1/E [L] it is constant (does not depend on x or A).
−
3.4
Problem 3.5
∼
If x[n] = A+w[n] for n = 1, 2, . . . , N are observed, where w = [w[1], w[2], . . . , w[N ]]T (0, C), find the CRLB for A. Does an efficient estimator exist and if so, what is its variance ?
N
Solution The joint p.d.f. of x is given by
−
1 1 p(x; A) = exp (x Ae)T C−1 (x Ae) 2 det(2πC) 1 1 = ln p(x; A) = ln (x Ae)T C−1 (x Ae) 2 det(2πC) ∂ 1 ∂ = ln p(x; A) = (x Ae)T C−1 (x Ae) ∂A 2 ∂A
⇒
⇒
Using the result that
−
−
−
−
−
∂ T m Qm = 2 ∂ θ
−
−
9
−
∂ T m Qm ∂ θ
Setting Q = C −1 and m = (x ∂ T ∂ m = (x ∂A ∂A
− Ae), ∂ AeT ) = −eT − Ae)T = (0 − ∂A
So ∂ ln p(x; A) = e T C−1 (x ∂A
− Ae) = (eT C− x − AeT C− e) 1
1
The scalar eT Qe is nothing but sum of all the elements of Q for any Q. Consider, for example, [1, 1, 1]
a b c d e f g h i
1 1 1
= [a + d + g, b + e + h, c + f + i]
(17)
1 1 1
= a + d + g + b + e + h + c + f + i
(18)
(19)
So, denoting α = e T C−1 e, ∂ ln p(x; A) = (eT C−1 x ∂A
− AeT C− e) = α 1
eT C−1 x α
The above expression is clearly of the form ∂ ln p(x; A) = g(A)(h(x) ∂A
− A)
Hence, there exists a MVUE (the efficient estimator) given by eT C−1 x eT C−1 x ˆ MVUE = A = h(x) = = T −1 α e C e and its variance is ˆ = var(A)
1 = α
1 N i=1
N −1 j =1 (C )i,j
10
−A
3.5
Problem 3.9
We observe two samples of a DC level in correlated Gaussian noise x[0] = A + w[0] x[1] = A + w[1] where w = [w[0], w[1]]T is zero mean with covariance matrix C = σ 2
1 ρ ρ 1
The parameter ρ is the cross-correlation coefficient between w[0] and w[1]. Compute the CRLB of A and compare it to the case when ρ = 0 (WGN). Also explain what happens when ρ = 1.
±
Solution: Since
This is a special case of Problem 3.5 (see above) for N = 2. C−1 =
1 σ 2 (ρ2
the CRLB is ˆ var A =
− 1)
−
1 ρ
ρ 1
−
1 σ 2 (ρ2 1) = 2(ρ 1) eT C−1 e
− −
ˆ σ 2 /2, as expected. But when ρ When ρ = 0, var A = 1, the matrix C becomes singular, hence its inverse does not exist; it means that the samples w[0] and w[1] are almost perfectly correlated and hence do not carried any additional information.
→ ±
3.6
Problem 3.13
Consider polynomial curve fitting p−1
x[n] =
Ak nk + w[n]
k=0
for n = 0, 1, . . . , N 1. w[n] is i.i.d. WGN with variance σ 2. It is desired to estimate A0 , A1, . . . , A p−1 . Find the Fisher information matrix for this problem.
{
−
}
11
The joint p.d.f. is
Solution:
p(x; A) =
n=0
1
2πσ 2
1 x[n] 2σ 2
exp
1 = exp (2πσ 2 )N/2 1 ln p(x; A) = ln (2πσ 2)N/2
⇒
=
⇒
=
∂ ln p(x; A) = 0 ∂A i
Because ∂ ∂A i
p−1
Ak nk =
k=0
=
2
√ − − − − − − − − −
N −1
1 2σ 2
1 2σ 2
1 2σ 2
p−1
Ak nk
k=0
2
p−1
N −1
Ak nk
x[n]
n=0
k=0
2
p−1
N −1
Ak nk
x[n]
n=0
k=0
p−1
N −1
Ak nk
2 x[n]
n=0
0
ni
k=0
∂ A1 n1 + A2 n2 + . . . + Ai ni + . . . + AN nN ∂A i
∂ 0 + 0 + ... + Ai ni + 0 ∂A i
= n i
Hence, the simplification:
− − −
∂ 1 ln p(x; A) = 2 ∂A i σ
⇒
=
∂ 2 1 ln p(x; A) = 2 ∂A j ∂A i σ =
p−1
N −1
n
i
Ak nk
x[n]
n=0 N −1
k=0
ni 0
n j
n=0 N −1
1 σ2
ni+ j
n=0
Hence, by definition, (i, j)th entry of the the p p Fisher information matrix I(A) is given by
×
[I(A)]i,j =
∂ 2 1 E ln p(x; A) = 2 ∂A i ∂A j σ
−
N −1
−
ni+ j
n=0
for i, j = 0, 1, . . . , p 1. Note that the Fisher information matrix is symmetric, so the order of evaluation of partial derivatives can be interchanged. See 12
pg. 42, Eq (3.22) in the textbook for a special case of the above for p = 2. Note that for the (0, 0)th entry of the matrix, the above expression gives N −1
N −1
n
i+ j
=
n=0
n0+0 = (00 + 10 + . . . + (N
0
− 1) )
n=0
where 00 must be taken as 1 (even though some authors disagree).
4
Chapter 5
Neyman-Fisher Factorization Theorem If we can factor the p.d.f p(x; th) as p(x; th) = g(T (x), θ)h(x)
·
·
where g( ) is a function depending on x only through T (x) and h( ) is a function depending only on x, then T (x) is a sufficient statistic for θ. The converse is also true. 4.1
Problem 5.2
The IID observations x n for n = 1, 2, . . . , N have exponential p.d.f 2
p(xn ; σ ) =
xn exp( σ2
0
2
2
−xn/2σ )
xn > 0 otherwise
Find a sufficient statistic for σ 2 . Solution Let u(t) be the unit step function. The joint PDF of x1 , x2, . . . , xn is given by (because they are independent), N 2
p(x; σ ) =
p(xn ; σ 2 )
n=1 N
=
xn exp( x2n /2σ 2 )u(xn ) σ2
n=1 N
=
n=1
−
xn u(xn )
2
= h(x)g(T (x), σ )
13
1 exp σ2
N
− 1 2σ 2
x2n
n=1
whence, the sufficient statistic for σ 2 is T (x) N
T (x) =
x2n
n=1
4.2
Problem 5.5
The IID observations xn for n = 1, 2, . . . , N are distributed according to [ θ, θ], where θ > 0. Find a sufficient statistic for θ .
U −
Solution The individual sample p.d.f. is given by p(xn ; θ) =
−θ < xn < θ
1/2θ 0
otherwise
The joint p.d.f is given by N
p(x; θ) =
p(xn ; θ)
n=1
=
1/(2θ)N θ < xn < θ, n 0 otherwise
−
∈N
Define a function bool(S ) for any mathematical statement S such that bool(S ) =
1 S is true 0 S is false
(This is also called as Indicator function, see Wikipedia). Then p(x; θ) =
1 bool( θ < xn < θ, (2θ)N
−
∀ ∈ N)
But,
⇒ =⇒ =⇒
xn < θ =
··· and θ > xN (θ > x ) ∩ (θ > x ) ∩ ··· ∩ (θ > xn ) θ > max {x , x , . . . , xN } θ > x1 and θ > x2 1
2
1
2
Similarly,
−θ < xn =⇒ =⇒
θ>
−xn
θ > max
{−x , −x , . . . , −xN }
14
1
2
Combining both of the above,
−θ < xn < θ =⇒ (−θ < xn) ∩ (θ > xn) =⇒ (θ > max(−x)) ∩ (θ > max(x)) =⇒ θ > max {|x |, |x |, . . . , |xN |} 1
2
So, the joind p.d.f. becomes 1 bool(max x1 , x2 , . . . , xN (2θ)N = g(T (x), θ)h(x)
p(x; θ) =
{| | | |
| |} < θ)
where h(x) = 1 and
{| | | |
| |}
T (x) = max x1 , x2 , . . . , xN 1 bool(T (x) < θ) g(T (x), θ) = (2θ)N
Hence, by Neyman-Fisher factorization theorem, T (x), as given above, is the sufficient statistic. Note: The sample mean is not a sufficient statistic for uniform distribution!
5
Chapter 7: MLE
The MLE for a scalar parameter is defined as the value of parameter t that maximizes p(x; t) for a given, fixed x, i.e., the value that maximizes the likelihood function. The maximization is performed over the allowable range of t. To find the MLE, solve the equation ∂ ln p(x; t) = 0 ∂t for t. This equation may have multiple solutions and you should choose the one appropriately. Theorem. If an efficient estimator (the estimator which attains CRLB) exists, then MLE procedure will find it. The MLE is
• asymptotically unbiased i.e., E (tˆ) → t as N → ∞. 15
• asymptotically efficient i.e., var( tˆ) → CRLB as N → ∞. • asymptotically optimal i.e., both of the above are true Theorem. If the pdf p(x; t) is twice differentiable and the Fisher information I (t) is nonzero, then the MLE of the unknown parameter t is asymptotically distributed (for large N ) according to tˆ
1
∼∼ N t, I − (t)
i.e., Gaussian distributed with mean equal to true value t and variance equal to CRLB ( = inverse of Fisher information). Theorem. Assume that the MLE ˆt of unknown parameter t is known. Consider a transformation function of t, τ = f (t)
·
for any function f ( ). Then the MLE τˆ of τ is nothing but τˆ = f (tˆ)
16