updated: 11/23/00, 1/12/03 (answer to Q7 of Section 1.3 added)
Hayashi Econometrics : Answers Answers to Selected Review Questions Questions
Chapter 1 Section 1.1
1. The intercept is increased by log(100). 2. Since (ε (εi , xi ) is independent of (ε (εj , x1 , . . . , xi X, εj ) = E(ε E(εi | xi ). So
−1
E(ε E(εi εj | X) = E[E(ε E[E(εj ε i | X, εj ) | X] = E[ε E[εj E(ε E(εi | X, εj ) | X] = E[ε E[εj E(ε E(εi | xi ) | X] = E(ε E(εi | xi ) E(ε E(εj | xj ).
, xi+1 , . . . , xn ) for i = j , we hav have: E(ε E(εi |
(by (by La Law w of Iter Iterat ated ed Expec Expecta tati tion ons) s) (by (by linear linearit ity y of condit condition ional al expecta expectatio tions) ns)
The last equality follows from the linearity of conditional expectations because E(ε E( εi | x i ) is a function of xi . 3.
E(y E(yi | X) = E(x E(xi β + ε + εi | X) (by (by Assu Assump mpti tion on 1.1) 1.1) = x i β + E(ε E( εi | X) (since xi is included in X in X)) = x i β (by Assumption 1.2). 1.2).
Conversely, suppose E(y E(yi | X) = x i β (i = 1, 2, . . . , n) n). Define ε Define ε i ≡ yi − E(y E(yi | X). Then by construction Assumption 1.1 is satisfied: εi = y = y i − xi β. Assumption Assumption 1.2 is satisfied satisfied becau b ecause se
E(ε E(εi | X) = E(y E(yi | X) − E[E(y E[E(yi | X) | X] (by defin definit itio ion n of ε ε i here) =0 (since E[E(yi | X) | X] = E(y E(yi | X)). )). 4. Because of the result in the previous review question, what needs to be verified is Assumption 1.4 and that E( CON i | YD 1 , . . . , YD n ) = β 1 + β + β 2 YD i . That the latter latter holds holds is clear from the i.i.d. assumption assumption and the hint. From the discussion discussion in the text on random samples, samples, Assumption 1.4 is equivalent to the condition that E(ε E(εi2 | YD i ) is a constant, where εi ≡ CON i − β 1 − β 2 YD i . E(ε E(εi2 | YD i ) = Var(ε Var(εi |
YD i )
= Var(CON i |
(since E(εi |
YD i ).
This is a constant since ( CON i , YD i ) is jointly normal. 5. If x x i2 = x j 2 for all i, all i, j , then the rank of X X would be one. 1
YD i )
= 0)
6. By the Law of Total Expectations, Assumption 1.4 implies E(ε E(εi2 ) = E[E(ε E[E(εi2 | X)] = E[σ E[σ 2 ] = σ 2 . Similarly for E(ε E(εi εj ). Section 1.2
5. (b)
e e = (Mε) (Mε) = ε M Mε (recall (recall from matrix matrix algebra algebra that (AB) AB) = B A ) = ε MMε (since M (since M is symmetric) = ε Mε (since M (since M is itempotent). itempotent).
6. A change in the unit of measurement for y means that y that y i gets multiplied by some factor, say λ, for all i all i.. The OLS formula shows that b that b gets gets multiplied by λ by λ.. So y So y gets gets multiplied by the 2 same factor λ, leaving R unaffected. unaffected. A change change in the unit of measureme measurement nt for regressors regressors 2 leaves x leaves x i b, and hence R hence R , unaffected.
Section 1.3
4(a). Let d ≡ β − E(β | X), a ≡ β − E(β ), and c ≡ E(β | X) − E(β ). Then Then d = a − c and dd = aa − ca − ac + cc . By taking unconditional expectations of both sides, we obtain
E(dd E(dd ) = E(aa E(aa ) − E(ca E(ca ) − E(ac E(ac ) + E(cc E(cc ).
Now,
E(dd E(dd ) = E[E(dd E[E(dd | X)]
(by (by La Law w of Total otal Expec Expecta tati tion ons) s)
= E E[(β − E(β | X))(β − E(β | X)) | X] = E Var(β | X)
(by the first equation equation in the the hint) hint).
By definition of variance, E(aa E(aa ) = Var(β ). By the second equation equation in the hint, E(cc E(cc ) = Var[E(β | X)]. For E(ca E(ca ), we have:
E(ca E(ca ) = E[E(ca E[E(ca | X)]
= E E[(E(β | X) − E(β))(β − E(β)) | X]
= E (E(β | X) − E(β))E[(β − E(β)) | X]
= E (E(β | X) − E(β))(E(β | X) − E(β ))
= E(cc E( cc ) = Var[E(β | X)]. )].
Similarly, E(ac E(ac ) = Var[E(β | X)] .
2
4(b). Since by assumption E(β | X) = β , we have Var[E(β | X)] = 0. 0 . So the equality in (a) for the unbiased estimator β becomes Var(β) = E[Var(β | X)]. Similarly for the OLS estimator b, we have: Var(b Var(b) = E[Var(b E[Var( b | X)]. As noted in the hint, E[Var( β | X)] ≥ E[Var(b E[Var(b | X)].
7. pi is the i-th diagonal element of the projection matrix P. Since Since P is positive semi-definite, n its diagonal elements elements are all non-negative non-negative.. Hence pi ≥ 0. K because this sum i=1 pi = K because equals the trace of P which P which equals K equals K . To show that p that p i ≤ 1, first note that p that p i can be written as: ei Pei where e where e i is an n an n-dimensional -dimensional i i-th -th unit vector (so its i its i-th -th element is unity and the other elements elements are all zero). zero). Now, Now, recall recall that for the annihilator annihilator M, we have M = I − P and M and M is positive semi-definite. So
ei Pei = e i ei − ei Mei = 1 − ei Mei (since e (since e i ei = 1) sincee M is positive semi-definite). semi-definite). ≤ 1 (sinc
Section 1.4
6. As explained in the text, the overall significance increases with the number of restrictions to be tested if the t test is applied to each restriction without adjusting the critical value. Section 1.5
2. Since ∂ 2 log L(ζ )/(∂ θ ∂ ψ ) = 0, the information matrix I(ζ ) is block diagonal, with its first block corresponding to θ and the second corresponding to ψ . The inverse is block diagonal, with its first block being the inverse of
−E
∂ 2 log L(ζ )
∂ θ ∂ θ
.
So the Cramer-Rao bound for θ is the negative of the inverse of the expected value of (1.5.2). (1.5.2). The expectation, expectation, howeve however, r, is over over y and X because here the density is a joint density. Therefore, the Cramer-Rao bound for β is σ is σ 2 E[(X E[(X X)] 1 .
−
Section 1.6
3. Var(b Var(b | X) = (X ( X X)
−1
−1
X Var(ε | X)X(X X)
.
Section 1.7
2. It just changes the intercept by b 2 times log(1000). log(1000). 5. The restricted regression is log
TC i
pi2
pi1 = β = β 1 + β + β 2 log(Q log(Qi ) + β + β 3 log pi2
+ β 5 log
pi3 pi2
+ εi .
(1)
The OLS estimate of (β (β 1 , . . . , β5 ) from (1.7.8) is (− ( −4.7, 0.72 72,, 0.59 59,, −0.007 007,, 0.42). 42). The OLS OLS estimate from the above restricted regression should yield the same point estimate and standard errors. The S SR should be the same, but R but R 2 should be different. 3
6. That’s because the dependent variable in the restricted regression is different from that in the unrestricted regression. If the dependent variable were the same, then indeed the R2 should be higher for the unrestricted model. 7(b) No, because when the price of capital is constant across firms we are forced to use the adding-up restriction β 1 + β 2 + β 3 = 1 to calculate β 2 (capital’s contribution) from the OLS estimate of β 1 and β 3 . 8. Because input choices can depend on εi , the regressors would not be orthogonal to the error term. Under the Cobb-Douglas technology, input shares do not depend on factor prices. Labor share, for example, should be equal to α1 /(α1 +α2 +α3 ) for all firms. Under constant returns to scale, this share equals α 1 . So we can estimate α’s without sampling error.
4
updated: 11/23/00
Hayashi Econometrics : Answers to Selected Review Questions
Chapter 2 Section 2.1
1. For n sufficiently large, zn
| − α| < ε, which means Prob(|z − α| > ε) = 0. 2. The equality in the hint implies that lim E[(z −z) (z −z)] = 0 if and only if lim n
n→∞
n
n
n→∞
E[(znk
2
zk ) ] = 0 for all k.
Section 2.2
6. Because there is a one-to-one mapping between (gi−1 , . . . , g1 ) and (zi−1 , . . . , z1 ) (i.e., the value of (gi−1 , . . . , g1 ) can be calculated from the value of (zi−1 , . . . , z1 ) and vice versa), E(gi gi−1 , . . . , g1 ) = E(gi zi−1 , . . . , z1 ) = E(zi zi−1 zi−1 , . . . , z1 ) = 0.
|
|
−
|
7. E(gi gi−1 , . . . , g2 ) = E[E(εi εi−1 εi−1 , . . . , ε1 ) gi−1 , . . . , g 2 ]
|
(by Law of Iterated Expectations) | | = E[ε E(ε | ε , . . . , ε ) | g , . . . , g ] (by linearity of conditional expectations) = 0 (since {ε } is independent white noise). 8. Let x ≡ r . Since (x , . . . , x ) (i ≥ 3) has less information than (y , . . . , y ), we have E(x | x , . . . , x ) = E[E(x | y , . . . , y ) | x , . . . , x ] for i ≥ 3. It is easy to show by the Law of Iterated Expectations that E( x | y , . . . , y ) = 0.
·
i−1
i
i−1
1
i−1
2
i−1
2
i−1
2
i
i
i1
i
i−2
i−2
i
1
i−1
1
2
i−2
i
1
Section 2.3
1. We have shown on several occasions that “E(εi xi ) = 0” is stronger than “E(xi εi ) = 0”.
|
·
2(a) No, E(ε2i ) does not need to exist or to be finite. 3. S = E(ε2i xi xi ) = E[E(ε2i xi xi xi )] = E[E(ε2i xi )xi xi ]. The second equality is by Law of Total Expectations. The third equality is by the linearity of conditional expectations.
|
|
4. You can use Lemma 2.3(a) to claim, for example, plim(b β ) S (b (b β ) S (b β ) is a continuous function of b β and S .
−
xx
−
−
−
xx
− β) = 0 because
xx
5. When you use a consistent estimator β to calculate the estimated residual εi , (2.3.1) becomes 1 n
n
n
− − 2
εi i
i=1
1 = n
ε2i
2(β
i=1
− −
β) g + ( β
β) S
xx
(β
β).
You can use exactly the same argument to claim that the second and the third terms on the RHS converge to zero in probability. 1
−
Section 2.4
1. Yes, SE ∗(bk ) converges to zero in probability. Consequently, the confidence interval shrinks to a point.
2. By the delta method, Avar( λ) = (1/β )2 Avar(b). The standard error of λ is by definition the square root of 1/n times the estimated asymptotic variance. 3. Inspection of the formula (2.4.2) for W reveals that the numerical value of W is invariant to F. So, a fortiori , the finite-sample distribution and the asymptotic distribution are not affected. Section 2.5
1. No, because (2.5.1) cannot be calculated by those sample means alone. 2. First, (2.5.4’) involves multiplication by n, which is required because it is an asymptotic variance (the variance of the limiting distribution of n times a sampling error). Second, the middle matrix B houses estimated errors, rather than error variances.
√
Section 2.6
5. From the equaion in the hint, we can derive: nR2 =
n−K n
1 + n (K 1
− 1)F (K − 1)F.
Since (K 1)F converges in distribution to a random variable, n1 (K 1)F p 0 by Lemma 2.4(b). So the factor multiplying (K 1)F on the RHS converges to 1 in probability. Then by Lemma 2.4(c), the asymptotic distribution of the RHS is the same as that of (K 1)F , which is chi-squared.
−
− →
−
−
Section 2.8
1. The proof is a routine use of the Law of Total Expectations. E(zi ηi ) = E[E(zi ηi xi )] (by Law of Total Expectations) = E[zi E(ηi xi )] (by linearity of conditional expectations) = 0.
·
· | · |
2. The error may be conditionally heteroskedastic, but that doesn’t matter asymptotically because all we need from this regression is a consistent estimate of α . Section 2.9
1. E[ηφ(x)] = E E[ηφ(x) x] (by Law of Total Expectations) = E φ(x) E[η x] (by linearity of conditional expectations) = 0 (since E(η x) = E[y E(y x) x] = E(y x) E(y x) = 0).
{ {
| } | } |
−
| |
2
| −
|
2. Use (2.9.6) to calculate E * (εi εi−1 , . . . , εi−m ). (I am using E* for the least squares projection operator.) It is zero. For E* (εi 1, εi−1 , . . . , εi−m ), use (2.9.7). For white noise processes, µ = 0 and γ = 0. So E* (εi 1, εi−1 , . . . , εi−m ) = 0. The conditional expectation, as opposed to the least squares projection, may not be zero. Example 2.4 provides an example.
| | |
3. If E(y x) = µ + γ x, then y can be written as: y = µ + γ x + η, E(η x) = 0. So Cov(x, y) = Cov(x, µ + γ x + η) = Var(x)γ . Also, E(y) γ E(x) = µ. Combine these results with (2.9.7).
|
−
|
4(b) β. 4(c) The answer is uncertain. For the sake of concreteness, assume yi , xi , zi is i.i.d. Then the asymptotic variance of the estimate of β from part (a) is Σ−1 E(ε2i xi xi )Σ−1 . The asymptotic variance of the estimate of β from part (b) is Σ−1 E[(zi δ + εi )2 xi xi ]Σ−1 . For concreteness, strengthen the orthogonality of xi to (εi , zi ) by the condition that xi is independent of (εi , zi ). Then these two expressions for the asymptotic variance becomes: E(ε2i )Σ−1 and E[(zi δ + ε i )2 ]Σ−1 . Since εi is not necessarily orthogonal to zi , E(ε2i ) may or may not be greater than E[(zi δ + εi )2 ].
{
}
xx
xx
xx
xx
xx
xx
Section 2.10
1. The last three terms on the RHS of the equation in the hint all converges in probability to µ 2 . 2. Let c be the p-dimensional vector of ones, and dn =
n + 2 n + 2 n + 2 , ,..., n 1 n 2 n p
−
−
−
.
Then the Box-Pierce /Q can be written as c xn and the modified Q as dn xn . Clearly, an c dn converges to zero as n .
≡ −
→∞
Section 2.11
2. You have proved this for general cases in review question 3 of Section 2.9. 3. We can now drop the assumption of constant real rates from the hypothesis of efficient markets. Testing market efficiency then is equivalent to testing whether the inflation forecast error is an m.d.s. 4. If inflation and interest rates are in fractions, then the OLS estimate of the intercept gets divided by 100. The OLS estimate of the interest rate coefficient remains the same. If the inflation rate is in percent per month and the interest rate is in percent per year, then both the intercept and the interest rate coefficient is deflated by a factor of about 12. 5. For the third element of gt , you can’t use the linearity of conditional expectations as in (2.11.7).
3
updated: February 17, 2001; January 16, 2002 (minor correction on 3.3.8); February 4, 2003 (correction on 3.1.4); February 23, 2004 (correction on 3.3.8)
Hayashi Econometrics : Answers to Selected Review Questions
Chapter 3 Section 3.1
1. By (3.1.3a),
Cov( pi , ui ) =
Cov(vi , ui ) Var(ui ) . α1 β 1
− −
The numerator can be positive. 2. The plim of the OLS estimator equals
α0 + α1 4. By (3.1.10a), Cov( pi , ui ) =
−
Cov( pi , ui ) Var( pi )
E( pi ).
− Var(u )/(α − β ) = 0 and Cov( p , ζ ) = Var(ζ )/(α − β ) = 0. i
1
1
i
i
i
1
1
xi remains a valid instrument without the assumption that demand and supply shifters are uncorrelated. Section 3.2
2. After the substitution inidcated in the hint, you should find that the log labor coefficient is
unity in the output equation. 3. The demand for labor is now
Li =
w p
exp
vi 1 φ1
.
exp
.
1
φ1 −1
(Ai )
1 1−φ1
(φ1 )
1 1−φ1
−
Substitute this into the production function to obtain Qi =
w p
φ1 φ1 −1
(Ai )
1 1−φ1
(φ1 )
φ1 1−φ1
vi 1 φ1
−
So the ratio of Q i to L i doesn’t depend on A i or v i . Section 3.3
1. The demand equation in Working’s model without observable supply shifter cannot be iden-
tified because the order condition is not satisfied. With the observable supply shifter, the demand equation is exactly identified because the rank condition is satisfied, as explained in the text, and the order condition holds with equality. 1
2. Yes. 3. The orthogonality condition is
E[log(Qi )]
− φ − φ E[log(L )] = 0. 0
1
i
4. In Haavelmo’s example, yi = C i , zi = (1, Y i ) , xi = (1, I i ) . In Friedman’s PIH, yi = C i , zi = Y i , xi = 1. In the production function example, y i = log(Qi ), zi = (1, log(Li )) , xi = 1. 5.
is a linear combination of the L columns of Σxz (see (3.3.4)). So adding columns of Σ xz doesn’t change the rank.
σ xy
σ xy
to the
6. Adding extra rows to Σxz doesn’t reduce the rank of Σxz . So the rank condition is still
satisfied. 7. The linear dependence between AGE i , EXPRi , and S i means that the number of instruments is effectively four, instead of five. The rank of Σ xz could still be four. However, the full-rank
(non-singularity) condition in Assumption 3.5 no longer holds. For
α
So
α
α
= (0, 1, 1, 1, 0) ,
− −
gi gi = ε 2i (α xi )xi = 0 .
E(gi gi ) = 0 , which means E(gi gi ) is singular.
, which is of full column rank. E(ε2i xx ) = A E(gi gi )A . This is nonsingular because A is of full row rank and E( gi gi ) is positive definite.
8. Σxz
≡ E(xz ) = AΣ
xz
Section 3.4
2. 0. Section 3.5
3. The expression in brackets in the hint converges in probability to zero.
√ ng converges in
distribution to a random variable. So by Lemma 2.4(b), the product converges to zero in probability. 4. The three-step GMM estimator is consistent and asymptotically normal by Proposition 3.1. Since the two-step GMM estimator is consistent, the recomputed S is consistent for S . So
by Proposition 3.5 the three-step estimator is asymptotically efficient. Section 3.6
1. Yes. 3. The rank condition for x 1i implies that K 1
≥ L.
4. No, because J 1 = 0. Section 3.7
2. They are asymptotically chi-squared under the null because S is consistent. They are, however,
no longer numerically the same. 2
Section 3.8
1. Yes. 2. Without conditional homoskedasticity, 2SLS is still consistent and asymptotically normal, if
not asymptotically efficient, because it is a GMM estimator. Its Avar is given by (3.5.1) with W = (σ 2 Σxx ) 1 . −
5. Sxz is square. 7. No.
3
updated: December 10, 2000; January 17, 2002 (minor correction on 4.6.3)
Hayashi Econometrics : Answers to Selected Review Questions
Chapter 4 Section 4.5 2. Even
without conditional homoskedasticity, FIVE is consistent and asymptotically normal because it is a GMM estimator. It is not efficient because its choice of W is not efficient without conditional homoskedasticity.
3. They 4. The
are numerically the same.
hint is the answer.
5. This
is so because
xi is
the union of all the regressors.
6. The
SUR estimator with this expanded xi is numerically the same as the SUR estimator without MED in x i . Sargan’s statistic will be numerically different. The degrees of freedom of its chi-square asymptotic distribution increase by two.
Section 4.6 1. The
rank condition is violated if zim1 = z im2 = 1.
2. Not
necessarily.
3. The
efficient GMM estimator is (4.6.6) with xim = zim, Wmh = (m, h) block of S given in (4.3.2) (or (4.5.3) under conditional homoskedasticity) with xim = z im. It is not the same as pooled OLS unless the estimated error covariance Σ happens to be spherical. It is not the same as the RE estimator because the orthogonality conditions used here are different from those used by RE.
1
updated: 12/10/00
Hayashi Econometrics : Answers to Selected Review Questions
Chapter 5 Section 5.1
2. bi = (1, IQ i ) , β = (φ2 − φ1 , φ3 − φ1 , β ) , and γ = (φ1 , γ ) . . . 3. Let s i be (S69 , S80 , S82 ) . Then QF i = [Q..Qsi ]. So QFi ⊗ xi = [Q ⊗ xi ..Qsi ⊗ xi ] and . E(QFi ⊗ xi ) = [E(Q ⊗ xi )..E(Qsi ⊗ xi )] (3K ×4)
(3K ×3)
2/3 E(x ) x ) = 1/3 E(x ) i
E(Q ⊗
i
(3K ×1)
−1/3 E(xi )
2/3 E(xi ) −1/3 E(xi ) −1/3 E(xi ) −
i
1/3 E(x ) .
−1/3 E(xi ) −
i
2/3 E(xi )
The columns of this matrix are not linearly independent because they add up to a zero vector. Therefore, E(QFi ⊗ xi ) cannot be of full column rank. (3K ×4)
Section 5.2
1. No. 4. Since ηi = Qεi , E(ηi ηi ) = QΣQ, where Σ because Q is singular.
≡
E(εi εi ). This matrix cannot be nonsingular,
Section 5.3
1.
1/2 Q = 0
0 1 −1/2 0
−1/2
0 1/2
.
Section 5.4
2(b) If Cov(sim , yim − yi,m−1 ) = 0 for all m, then Σ becomes xz
Σ
1 E(s ) = 0 i1
xz
0
0 E(yi1 − yi0 ) 0 E(si1 ) E(yi1 − yi0 ) . 1 E(yi2 − yi1 ) E(si2 ) E(si2 ) E(yi2 − yi1 )
This is not of full column rank because multiplication of Σ from the right by (E(yi1 − yi0 ), E(yi2 − yi1 ), 1) produces a zero vector. xz
1
updated: 12/15/00
Hayashi Econometrics : Answers to Selected Review Questions
Chapter 6 Section 6.1
≡ |γ |. Then s − s = |γ | for m > n. sequence {s } is Cauchy, and hence is convergent. 3. Proof that “β (L) = α(L)− δ (L) ⇒ α(L)β (L) = δ (L)”: n j =1
1. Let sn
j
m
m j =n+1
n
j
Since sm
n
| − s | → 0, the n
1
α(L)β (L) = α(L)α(L)−1 δ (L) = δ (L). Proof that “α(L)β (L) = δ (L)
⇒ α(L) = δ (L)β (L)− ”: 1
δ (L)β (L)−1 = α(L)β (L)β (L)−1 = α(L). Proof that “α(L) = δ (L)β (L)−1
⇒ α(L)β (L) = δ (L)”:
α(L)β (L) = δ (L)β (L)−1 β (L) = δ (L)β (L)β (L)−1 = δ (L). 4. The absolute value of the roots is 4/3, which is greater than unity. So the stability condition
is met. Section 6.2 ∗
|
1. By the projection formula (2.9.7), E (yt 1, yt−1 ) = c + φyt−1 . The projection coefficients does
∗
not depend on t. The projection is not necessarily equal to E( yt yt−1 ). E (yt 1, yt−1 , yt−2 ) = c + φy t−1 . If φ > 1, then yt−1 is no longer orthogonal to εt . So we no longer have ∗ E (yt 1, yt−1 ) = c + φyt−1 .
|
| |
|
|
3. If φ(1) were equal to 0, then φ(z) = 0 has a unit root, which violates the stationarity condition.
To prove (b) of Proposition 6.4, take the expected value of both sides of (6.2.6) to obtain E(yt )
− φ E(y − ) − · · · φ E(y − ) = c. Since {y } is covariance-stationary, E(y ) = ···− E(y − ) = µ. So (1 − φ −···− φ )µ = c. t
1
t
1
t
p
t p
t p
1
p
Section 6.3
4. The proof is the same as in the answer to Review Question 3 of Section 6.1, because for inverses we can still use the commutatibity that A (L)A(L)−1 = A (L)−1 A(L). 5. Multiplying both sides of the equation in the hint from the left by A(L)−1 , we obtain B(L)[A(L)B(L)]−1 = A(L)−1 . Multiplying both sides of this equation from the left by B(L)−1 , we obtain [ A(L)B(L)]−1 = B (L)−1 A(L)−1 .
1
Section 6.5
√
(yn , . . . , y1 ) . Then Var( ny) = Var(1 y/n = 1 Var(y)1/n). By covariancestationarity, Var(y) = Var(yt , . . . , yt−n+1 ).
1. Let y
≡
3. lim γ j = 0. So by Proposition 6.8, y
→
m.s.
µ, which means that y
→ µ. p
Section 6.6
1. When z t = x t , the choice of S doesn’t matter. The efficient GMM etimator reduces to OLS. 2. The etimator is consistent because it is a GMM etimator. It is not efficient, though. Section 6.7
X(X ΩX) X , where is the vector of estimated residuals. . The truncated kernel-based estimator with a bandwidth ω be the (i, j) element of Ω 4. Let of q can be written as (6.7.5) with ω = ε ε for (i, j) such that |i − j | ≤ q and ω =0 otherwise. The Bartlett kernel based estimator obtains if we set ω = ε ε for (i, j) such that |i − j | < q and ω = 0 otherwise. β ) > Avar( β ) when, for example, ρ = φ . This is consistent with the fact that 5. Avar( 2. J =
ε
−1
ε
ε
ij
ij
i j
ij
ij
q−|i−j | i j q
ij
OLS
GLS
j
j
OLS is efficient, because the orthogonality conditions exploited by GLS are different from those exploited by OLS.
2
updated: 11/23/00
Hayashi Econometrics : Answers to Selected Review Questions
Chapter 7 Section 7.1
1. m(wt ; θ ) = − [yt − Φ(xt θ )]2 . 2. Since E(yt | x t ) = Φ (xt θ 0 ), we have:
E[xt · (yt − Φ(xt θ 0 )) | x t ] = x t E[yt − Φ(xt θ 0 ) | x t ] = 0. Use the Law of Total Expectations. g (wt ; θ ) = x t · (yt − Φ(xt θ )).
5. Qn is (7.1.3) with g (wt ; θ ) = x t · (yt − θ0 zt ). Qn is (7.1.3) with g (wt ; θ ) = x t · (zt − λ0 yt ). Section 7.2
2. Sufficiency is proved in the text. To show necessity, suppose (7.2.10) were false. Then there exists a θ1 in Θ such that φ(xt ; θ 1 ) = φ (xt ; θ 0 ). Then from (7.2.9), E[ {yt − φ(xt ; θ 1 )}2 ] = E[{yt − φ(xt ; θ0 )}2 ]. This is a contradiction because θ0 is the only maximizer. 3. What needs to be proved is: “E(xt xt ) nonsingular” ⇒ “xt θ = xt θ0 for
θ
=
θ 0 ”.
Use the
argument developed in Example 7.8. = Φ (xt θ 0 ) for θ = θ 0 ”. It was 4. What needs to be proved is: “E( xt xt ) nonsingular” ⇒ “Φ(xt θ ) shown in the previous review question that the nonsingularity condition implies x t θ = x t θ 0 for θ = θ 0 . 7. The Hessian matrix for linear GMM is negative definite. So the objective function is strictly
concave. = 0 for 8. So the identification condition is E[ g(wt ; θ 0 )] = 0 and W E[g(wt ; θ )]
= θ 0 . θ
Section 7.3
1. A better question would be as follows.
Consider a random sample ( w1 , . . . , wn ). Let f (wt ; θ0 ) be the density of wt , where θ 0 is the p-dimensional true parameter vector. The log likelihood of the sample is n
L(w1 , . . . , wn ; θ
log f (w ; )= t
θ ).
t=1
Let r n (θ ) be the score vector of this log likelihood function. That is, r n (θ ) is the p-dimensional gradient of L . In Chapter 1, we defined the Cramer-Rao bound 1
to be the inverse of E[ rn (θ 0 )rn (θ 0 ) ]. Define the asymptotic Cramer-Rao bound as the inverse of J ≡ lim
1
n→∞
n
E[rn (θ0 )rn (θ 0 ) ].
Assume that all the conditions for the consistency and asymptotic normality of the (unconditional) maximum likelihood estimator are satisfied. Show that the asymptotic variance matrix of the ML estimator equals the asymptotic CramerRao bound. The answer is as follows. Define s (wt ; θ ) as the gradient of log f (wt ; θ ). Then n
s(w ; r ( )= n
θ
t
θ ).
t=1
Since E[s(wt ; θ 0 )] = 0 and { s(wt ; θ 0 )} is i.i.d., we have n
E[rn (θ 0 )rn (θ 0 ) ] = Var(rn (θ 0
Var(s(w ; )) = t
θ 0 )) = n
· E[s(wt ; θ0 )s(wt ; θ0 ) ].
t=1
By the information matrix equality, it follows that 1 n
E[ rn (θ 0 )rn (θ 0 ) ] = − E[H(wt ; θ 0 )],
where H(wt ; θ ) is the hessian of the log likelihood for observation t. Therefore, trivially, the limit as n → ∞ of n1 E[rn (θ 0 )rn (θ 0 ) ] is − E[H(wt ; θ 0 )], which is the inverse of the asymptotic variance matrix.
2
updated: 11/23/00
Hayashi Econometrics : Answers to Selected Review Questions
Chapter 8 Section 8.1
1(a) Deriving the score should be easy. Differentiating the score with respect to θ and rearranging, you should obtain yt − 2yt F t + F t2 2 yt − F t − x x + f f t xt xt . t t t 2 [F t · (1 − F t )] F t · (1 − F t )
Since yt is either 0 or 1, we have yt = yt2 . So yt − 2yt F t + F t2 , which is the numerator in the first term, equals yt2 − 2yt F t + F t2 = (yt − F t )2 . Section 8.3
2. Since λ (−v ) + v ≥ 0 for all − v , the coefficients of the two matrices in (8.3.12) are nonpositive. So the claim is proved if the two matrices are both positive semi-definite. The hint makes clear that they are. 3. Yes, because even if the data are not i.i.d., the conditional ML estimator is still an M-estimator. Section 8.5
2. Since |Γ0 | = 0, the reduced form (8.5.9) exists. Since xtK does not appear in any of the structural-form equations, the last column of B0 is a zero vector, and so for any m the m-th reduced form is that ytm is a linear function of x t1 , . . . , xt,K 1 and v tm . Since x tK is predetermined, it is orthogonal to any element of the reduced-form disturvance vector vt . Therefore, in the least square projection of y tm on xt , the coefficient of x tK is zero. −
1
May 30, 2004
Hayashi Econometrics : Answers to Selected Review Questions
Chapter 9 Section 9.1
√
1 the hint, the long-run variance equals Var((uT − u0 )/ T ) = T Var(uT − u0 ). Var(uT − u0 ) = Var(uT )+Var(u0 ) − 2ρ(uT , u0 ) Var(uT ) Var(u0 ). Since the correlation coefficient ρ(uT , u0 ) is less than or equal to 1 in absolute value and since Var( uT ) and Var(u0 ) are finite, Var(uT − u0 is finite.
1. By
Section 9.2 3.
α0 = 1, α 1 =
−1, and α j = 0 for j = 2, 3,....
So η t = ε t − εt
1
−
.
Section 9.3 1.
T 1 η (ρ − 1) = T 1 T (ρ − 1). T (ρ − 1) converges in distribution to a random variable. Use Lemma 2.4(b). −
2. This
η
follows immediately from Proposition 9.2(a),(b), and (9.3.3).
T
∆yt is ergodic stationary (actually, iid here), T 1 1 t=1 (∆yt )2 →p E(∆yt ). By Proposition 9.3, [T · (ρ − 1)] converges in distribution to a random variable, and by Proposition T 1 9.2(b), T t=1 ∆yt yt 1 converges in distribution to a random variable. So the second term converges in probability to zero. Use a similar argument to show that the third term vanishes.
3. Since
−
−
4.
∆yt is stationary, so for the t-value from the first regression you should use the standard normal. The t-value from the second regression is numerically equal to (9.3.7). So use DF t .
5.
(a) As remarked on page 564, an I(0) process is ergodic stationary. So by the ergodic theorem
1 T
ρ =
where γ 0 = E(yt2 ) and γ 1 = E(yt yt
T t=1 yt yt−1 T 1 2 t=1 yt T
1
−
→p γ γ 1 , 0
). By assumption, γ 0 > γ 1 .
(b) It should be easy to show that 2
s So
ρ
√
T · t =
s
→p
− ÷ 1 T
2(γ 02 − γ 12 ) > 0. γ 0 1
T 2 t=1 (yt−1 )
1
=−
γ 0 − γ 1 < 0. 2(γ 0 − γ 1 )
7.
(a)
SB times T is
the reciprocal of DW with y t interpreted as the regression residual.
(b) The denominator of SB converges in distribution to E[(∆yt )2 ] = γ 0 . By Proposition 1 9.2(a), the numerator converges in distribution to λ 2 0 [W (r)]2 dr. Here, λ2 = γ 0 .
(c) If y t is I(0), T · SB → p
E(yt2 ) . E[(∆yt )2 ]
Section 9.4 1.
a1 = φ1 + φ2 + φ 3 , a2 = − φ2 , a3 = − φ3 . If yt is driftless I(1) following (9.4.1), then yt 1 is driftless I(1) while yt 1 − yt 2 and yt 1 − yt 3 is zero-mean I(0). a ≡ (a1 , a2 , a3 ) is a linear and non-singular transformation of φ ≡ (φ1 , φ2 , φ3 ) (that is, a = Fφ for some non-singular matrix F). So if φ is the OLS estimate of φ, then Fφ is the OLS estimate of a. (ρ, ζ 1 , ζ 2 ) from (9.4.3) with p = 2 is also a linear and non-singular transformation of φ. ρ = a 1 = φ 1 + φ2 + φ3 . −
−
−
−
−
2. Just
apply the mean value theorem to φ(z).
3. The
hint is almost the answer. In the final step, use the fact that
4. (a)
1 T
T 2 t=1 (∆yt )
→p γ 0 .
The hint is the answer. (b) Use Billingsley’s CLT. ∆yt 1 is a function of (εt 1 , εt 2 ,...). So ∆yt 1 and ε t are independently distributed, and E[(∆yt 1 εt )2 ] = E[(∆yt 1 )2 ] E(ε2t ) = γ 0 σ 2 . −
−
5. The
−
−
−
−
hint is the answer.
6. The
hint is almost the answer. We have shown in Review Question 3 to Section 9.3 that s →p σ 2 . It has been shown on page 588 that the (2,2) element of AT 1 converges in probability to γ 0 1 . 2
−
−
2
Nov. 22, 2003, revised Dec. 27, 2003
Hayashi Econometrics
Solution to Chapter 1 Analytical Exercises 1. (Reproducing the answer on p. 84 of the book)
(y−Xβ ) (y − Xβ ) = [(y − Xb) + X(b − β )] [(y − Xb) + X(b − β)] (by the add-and-subtract strategy)
= [( y − Xb) + (b − β ) X ][(y − Xb) + X(b − β )]
= ( y − Xb) (y − Xb) + ( b − β ) X (y − Xb)
+ (y − Xb) X(b − β) + ( b − β) X X(b − β)
= ( y − Xb) (y − Xb) + 2(b − β) X (y − Xb) + ( b − β) X X(b − β) (since (b − β ) X (y − Xb) = ( y − Xb) X(b − β)) = ( y − Xb) (y − Xb) + ( b − β ) X X(b − β )
(since X (y − Xb) = 0 by the normal equations) ≥ (y − Xb) (y − Xb) n
(since (b − β ) X X(b − β ) = z z =
zi2 ≥ 0 where z ≡ X(b − β)).
i=1
2. (a), (b). If X is an n × K matrix of full column rank, then X X is symmetric and invertible. It is very straightforward to show (and indeed you’ve been asked to show in the text) that MX ≡ In − X(X X)−1 X is symmetric and idempotent and that M XX = 0 . In this question, set X = 1 (vector of ones). (c) M1 y = [In − 1(1 1)−1 1 ]y
1 11 y (since 1 1 = n) n n 1 = y − 1 yi = y − 1· y n = y −
i=1
(d) Replace “ y” by “X” in (c). 3. Special case of the solution to the next exercise. 4. From the normal equations (1.2.3) of the text, we obtain (a)
X1 X2
. [X1 .. X 2 ]
b1 b2
=
X1 X2
y.
Using the rules of multiplication of partitioned matrices, it is straightforward to derive ( ∗) and (∗∗) from the above. 1
(b) By premultiplying both sides of (∗) in the question by X1 (X1 X1 )−1 , we obtain X1 (X1 X1 )−1 X1 X1 b1 = −X1 (X1 X1 )−1 X1 X2 b2 + X1 (X1 X1 )−1 X1 y
⇔
X1 b1 = −P1 X2 b2 + P1 y
Substitution of this into ( ∗∗) yields X2 (−P1 X2 b2 + P1 y) + X2 X2 b2 = X 2 y
⇔
X2 (I − P1 )X2 b2 = X 2 (I − P1 )y
⇔ ⇔
X2 M1 X2 b2 = X 2 M1 y
X2 M1 M1 X2 b2 = X 2 M1 M1 y (since M 1 is symmetric & idempotent)
X2 X2 b2 = X2 y.
⇔
Therefore,
b2 = (X2 X2 )−1 X2 y
(The matrix X2 X2 is invertible because X2 is of full column rank. To see that X2 is of full column rank, suppose not. Then there exists a non-zero vector c such that X2 c = 0 . But
X2 c = X 2 c − X1 d where d ≡ (X1 X1 )−1 X1 X2 c. That is, Xπ = 0 for π ≡
−d . This is c
. a contradiction because X = [X1 .. X2 ] is of full column rank and π = 0 .) (c) By premultiplying both sides of y = X 1 b1 + X2 b2 + e by M 1 , we obtain M1 y = M 1 X1 b1 + M1 X2 b2 + M1 e.
Since M1 X1 = 0 and y ≡ M1 y, the above equation can be rewritten as
y = M 1 X2 b2 + M1 e
= X2 b2 + M1 e.
M1 e = e because
M1 e = (I − P1 )e
= e − P1 e = e − X1 (X1 X1 )−1 X1 e = e (since X1 e = 0 by normal equations). (d) From (b), we have
b2 = (X2 X2 )−1 X2 y
= (X2 X2 )−1 X2 M1 M1 y = (X2 X2 )−1 X2 y.
Therefore, b2 is the OLS coefficient estimator for the regression y on X2 . The residual vector from the regression is
y − X2 b2 = (y − y) + ( y − X2 b2 )
= (y − M1 y) + ( y − X2 b2 ) = (y − M1 y) + e (by (c)) = P 1 y + e. 2
This does not equal e because P1 y is not necessarily zero. The SSR from the regression of y on X2 can be written as
(y − X2 b2 ) (y − X2 b2 ) = ( P1 y + e) (P1 y + e) = ( P1 y) (P1 y) + e e (since P 1 e = X 1 (X1 X1 )−1 X1 e = 0 ).
This does not equal e e if P1 y is not zero.
(e) From (c), y = X2 b2 + e. So
y y = (X2 b2 + e) (X2 b2 + e)
= b 2 X2 X2 b2 + e e (since X2 e = 0 ).
Since b2 = (X2 X2 )−1 X2 y, we have b 2 X2 X2 b2 = y X2 (X2 M1 X2 )−1 X2 y. (f)
(i) Let b1 be the OLS coefficient estimator for the regression of y on X 1 . Then
b1 = (X1 X1 )−1 X1 y
= ( X1 X1 )−1 X1 M1 y
= ( X1 X1 )−1 (M1 X1 ) y = 0 (since M1 X1 = 0 ).
So S SR 1 = (y − X1 b1 ) (y − X1 b1 ) = y y. (ii) Since the residual vector from the regression of y on X2 equals e by (c), SSR 2 = e e. (iii) From the Frisch-Waugh Theorem, the residuals from the regression of y on X1 and X2 equal those from the regression of M 1 y (= y) on M 1 X2 (= X2 ). So SSR 3 = e e.
5. (a) The hint is as good as the answer.
(b) Let ε ≡ y −Xβ, the residuals from the restricted regression. By using the add-and-subtract strategy, we obtain
ε ≡ y − Xβ = (y − Xb) + X(b − β).
So
SSRR = [(y − Xb) + X(b − β)] [(y − Xb) + X(b − β )]
= ( y − Xb) (y − Xb) + ( b − β ) X X(b − β) But S SR U = ( y − Xb) (y − Xb), so
(since X (y − Xb) = 0 ).
SS RR − SSRU = ( b − β ) X X(b − β )
= (Rb − r) [R(X X)−1 R ]−1 (Rb − r)
(using the expresion for β from (a))
= λ R(X X)−1 R λ
(using the expresion for λ from (a))
= ε X(X X)−1 X ε
(by the first order conditions that X (y − Xβ) = R λ)
= ε Pε. (c) The F -ratio is defined as F ≡
(Rb − r) [R(X X)−1 R ]−1 (Rb − r)/r s2 3
(where r = #r)
(1.4.9)
Since (Rb − r) [R(X X)−1 R ]−1 (Rb − r) = SSR R − SS RU as shown above, the F -ratio can be rewritten as (SS RR − SSRU )/r s2 (SS RR − SSRU )/r = e e/(n − K ) (SS RR − SSRU )/r = SSRU /(n − K )
F =
Therefore, (1.4.9)=(1.4.11). 6. (a) Unrestricted model : y = Xβ + ε, where
y (N ×1)
=
y1 .. .
,
X
(N ×K )
yn
=
1 .. .
x12 . . . x1K .. .. ... . . 1 xn2 . . . xnK
,
=
β (K ×1)
β 1 .. .
.
β n
Restricted model: y = Xβ + ε, Rβ = r , where
R
((K −1)×K )
=
0 1 0 0 0 1 .. .. . . 0 0
... 0 ... 0 .. . 1
,
r
((K −1)×1)
=
0 .. .
.
0
Obviously, the restricted OLS estimator of β is
β
(K ×1)
=
y 0 .. . 0
.
So Xβ =
y y .. . y
= 1 · y.
(You can use the formula for the unrestricted OLS derived in the previous exercise, β = b − (X X)−1 R [R(X X)−1 R ]−1 (Rb − r), to verify this.) If SSRU and SS RR are the minimized sums of squared residuals from the unrestricted and restricted models, they are calculated as n
(yi − y)2
SSRR = (y − Xβ) (y − Xβ) =
i=1
n
SS RU = ( y − Xb) (y − Xb) = e e =
e2i
i=1
Therefore, n
SS RR − SSRU =
i=1
4
n
2
(yi − y) −
i=1
e2i .
(A)
On the other hand,
(b − β) (X X)(b − β ) = (Xb − Xβ) (Xb − Xβ) n
(yi − y)2 .
=
i=1
Since S SR R − SSRU = (b − β) (X X)(b − β) (as shown in Exercise 5(b)), n
n
2
n
e2i
(yi − y) −
i=1
(yi − y)2 .
=
i=1
(B)
i=1
(b) F = = =
(SSRR − SSRU )/(K − 1) n 2 i=1 ei /(n − K )
(
n i=1 (yi
− y)2 −
n 2 i=1 ei )/(K −
n 2 i=1 ei /(n
(yi −y)2 /(K −1) n (yi −y)2 i=1 n e 2i /(n−K ) i=1 n (yi −y)2 i=1
1)
(by equation (A) above)
− K )
n 2 i=1 (yi − y) /(K − 1) n 2 i=1 ei /(n − K ) n
(by Exercise 5(c))
(by equation (B) above) n
i=1
=
=
R2 /(K − 1) (1 − R2 )/(n − K )
(by dividing both numerator & denominator by
(yi − y)2 )
i=1
(by the definition or R2 ).
7. (Reproducing the answer on pp. 84-85 of the book)
(a) βGLS − β = Aε where A ≡ (X V−1 X)−1 X V−1 and b − β GLS = Bε where B ≡ (X X)−1 X − (X V−1 X)−1 X V−1 . So
Cov(β GLS − β, b − β GLS) = Cov(Aε, Bε) = A Var(ε)B = σ 2 AVB . It is straightforward to show that AVB = 0 . (b) For the choice of H indicated in the hint,
Var(β ) − Var(β GLS) = −CVq−1 C . If C = 0 , then there exists a nonzero vector z such that C z ≡ v = 0 . For such z ,
z [Var(β) − Var(βGLS)]z = −v Vq−1 v < 0
which is a contradiction because β GLS is efficient.
5
(since Vq is positive definite),
Nov. 25, 2003, Revised February 23, 2010
Hayashi Econometrics
Solution to Chapter 2 Analytical Exercises 1. For any ε > 0,
1 n
Prob( zn > ε) =
| |
→ 0
as n
→ ∞.
So, plim zn = 0. On the other hand, E(zn ) = which means that limn→∞ E(zn ) =
n
− 1 · 0 + 1 · n2 = n, n
n
∞.
2. As shown in the hint, (z n
− µ)2 = (z − E(z n
2
n ))
+ 2(z n
− E(z
n ))(E(z n )
− µ) + (E(z ) − µ)2. n
Take the expectation of both sides to obtain E[(z n
− µ)2] = E[(z − E(z n
2
n ))
] + 2E[z n
= Var(z n ) + (E(z n )
− µ)2
− E(z
− µ) + (E(z ) − µ)2 − E(z )] = E(z ) − E(z
n )](E(z n )
(because E[z n
n
n
n
n)
= 0).
Take the limit as n
→ ∞ of both sides to obtain lim E[(z − µ)2 ] = lim Var(z ) + lim (E(z ) − µ)2
n→∞
n
n→∞
n
n
n→∞
= 0 (because lim E(z n ) = µ, lim Var(z n ) = 0). n→∞
n→∞
Therefore, zn
→m.s. µ. By Lemma 2.2(a), this implies z →p µ. n
3. (a) Since an i.i.d. process is ergodic stationary, Assumption 2.2 is implied by Assumption 2.2 . Assumptions 2.1 and 2.2 imply that gi xi εi is i.i.d. Since an i.i.d. process with mean zero is mds (martingale differences), Assumption 2.5 is implied by Assumptions 2.2 and 2.5 .
≡ ·
(b) Rewrite the OLS estimator as
− β = (X X) 1X ε = S 1 g . (A) {x } is i.i.d., {x x } is i.i.d. So by Kolmogorov’s Second Strong
b
Since by Assumption 2.2 LLN, we obtain
−
−
xx
i
i i
→p Σ
S
xx
xx
The convergence is actually almost surely, but almost sure convergence implies convergence in probability. Since Σ is invertible by Assumption 2.4, by Lemma 2.3(a) we get xx
S−1 xx
1
→p Σ
1
−
xx
.
Similarly, under Assumption 2.1 and 2.2 gi is i.i.d. By Kolmogorov’s Second Strong LLN, we obtain
{ }
→p E(g ),
g
i
which is zero by Assumption 2.3. So by Lemma 2.3(a), S−1 g xx
→p Σ 1· 0 = 0. −
xx
Therefore, plimn→∞(b β) = 0 which implies that the OLS estimator b is consistent. Next, we prove that the OLS estimator b is asymptotically normal. Rewrite equation(A) above as
−
√ n(b − β) = S 1 √ ng. −
xx
As already observed, gi is i.i.d. with E(gi ) = 0 . The variance of gi equals E(gi gi ) = S since E(gi ) = 0 by Assumption 2.3. So by the Lindeberg-Levy CLT,
{ }
√ ng → N (0, S). d
→p Σ 1. Thus by Lemma 2.4(c), √ n(b − β) → N (0, Σ 1 S Σ 1). d
Furthermore, as already noted, S −1
−
xx
xx
−
−
xx
xx
4. The hint is as good as the answer. 5. As shown in the solution to Chapter 1 Analytical Exercise 5, SSR R SSRR
− SSR
U
= (Rb
−
− r) [R(X X)
1
U can
− SSR 1 (Rb − r).
be written as
R ]−
Using the restrictions of the null hypothesis, Rb
− r = R(b − β)
= R (X X)−1 X ε (since b = RS
−
1
xx
g (where g
1 n
≡
−
− β = (X X) x · ε .).
1
X ε)
n
i
i
i=1
Also [R(X X)−1 R]−1 = n [RS−1 R]−1 . So
·
SSRR
xx
− SSR
U
√
√
= ( n g) S−1 R (R S−1 R )−1 R S−1 ( n g). xx
xx
xx
Thus SSRR
− SSR
U
s2
√
√
= ( n g) S−1 R (s2 R S−1 R )−1 R S−1 ( n g) xx
xx
xx
1 = z n A− n zn ,
where
≡ R S 1(√ n g),
≡ s2 R S
−
zn
An
xx
By Assumption 2.2, plim S Lemma 2.4(c), we have:
= Σ
xx
xx
1
−
xx
R .
. By Assumption 2.5, −
→d N (0, RΣ
zn
1
xx
2
SΣ −1 R ). xx
√ ng →
d
N (0, S). So by
But, as shown in (2.6.4), S = σ 2 Σ under conditional homoekedasticity (Assumption 2.7). So the expression for the variance of the limiting distribution above becomes xx
RΣ−1 SΣ −1 R = σ 2 RΣ−1 R xx
xx
xx
≡ A.
Thus we have shown:
→d z, z ∼ N (0, A).
zn
2 2 As already observed, S p Σ . By Assumption 2.7, σ = E(εi ). So by Proposition 2.2, s2 p σ2 . Thus by Lemma 2.3(a) (the “Continuous Mapping Theorem”), An p A. Therefore, by Lemma 2.4(d), 1 zn A− z A−1 z. n zn xx
→
→
xx
→
→d
But since Var(z) = A , the distribution of z A−1 z is chi-squared with #z degrees of freedom. 6. For simplicity, we assumed in Section 2.8 that yi , xi is i.i.d. Collecting all the assumptions made in Section 2.8,
{
}
(i) (linearity) y i = x i β + εi . (ii) (random sample) yi , xi is i.i.d.
{
}
(iii) (rank condition) E( xi xi ) is non-singular. (iv) E(ε2i xi xi ) is non-singular. (v) (stronger version of orthogonality) E(εi xi ) = 0 (see (2.8.5)).
|
(vi) (parameterized conditional heteroskedasticity) E(ε2i xi ) = z i α.
|
These conditions together are stronger than Assumptions 2.1-2.5. (a) We wish to verify Assumptions 2.1-2.3 for the regression equation (2.8.8). Clearly, Assumption 2.1 about the regression equation (2.8.8) is satisfied by (i) about the original regression. Assumption 2.2 about (2.8.8) (that ε2i , xi is ergodic stationary) is satisfied by (i) and (ii). To see that Assumption 2.3 about (2.8.8) (that E(zi ηi ) = 0) is satisfied, note first that E(ηi xi ) = 0 by construction. Since zi is a function of xi , we have E(ηi zi ) = 0 by the Law of Iterated Expectation. Therefore, Assumption 2.3 is satisfied. The additional assumption needed for (2.8.8) is Assumption 2.4 that E( zi zi ) be nonsingular. With Assumptions 2.1-2.4 satisfied for (2.8.8), the OLS estimator α is consistent by Proposition 2.1(a) applied to (2.8.8).
{
}
|
|
(b) Note that α
α = (α
α)
(α
− − −∗∗ −
α) and use the hint.
(c) Regarding the first term of ( ), by Kolmogorov’s LLN, the sample mean in that term converges in probability to E(xi εi zi ) provided this population mean exists. But E(xi εi zi ) = E[zi xi E(εi zi )].
· ·
|
By (v) (that E(εi xi ) = 0) and the Law of Iterated Expectations, E( εi zi ) = 0. Thus E(xi εi zi ) = 0. Furthermore, plim(b β ) = 0 since b is consistent when Assumptions 2.1-2.4 (which are implied by Assumptions (i)-(vi) above) are satisfied for the original regression. Therefore, the first term of ( ) converges in probability to zero. Regarding the second term of ( ), the sample mean in that term converges in probability to E(x2i zi ) provided this population mean exists. Then the second term converges in probability to zero because plim(b β ) = 0.
|
|
−
∗∗ −
∗∗
3
√ n,
(d) Multiplying both sides of ( ) by
∗
√ · − √ −
√ n(α − α) = 1 = n
1 n
n
n
1
−
zi zi
i=1
1
1 n
−
zi zi
2 n(b
i=1
n
zi vi
i=1
1 β ) n
n
i=1
√ 1 x ε z + n(b − β )· (b − β ) i i i
n
n
x2i zi .
i=1
Under Assumptions 2.1-2.5 for the original regression (which are implied by Assumptions (i)-(vi) above), n(b β ) converges in distribution to a random variable. As shown in (c), n1 n p 0. So by Lemma 2.4(b) the first term in the brackets vanishes i=1 xi εi zi 2 (converges to zero in probability). As shown in (c), (b β ) n1 n i=1 xi zi vanishes provided E(x2i zi ) exists and is finite. So by Lemma 2.4(b) the second term, too, vanishes. Therefore, n(α α) vanishes, provided that E(zi zi ) is non-singular.
√ − →
√ −
−
7. This exercise is about the model in Section 2.8, so we continue to maintain Assumptions (i)(vi) listed in the solution to the previous exercise. Given the hint, the only thing to show is that the LHS of ( ) equals Σ −1 S Σ−1 , or more specifically, that plim n1 X VX = S . Write S as
∗∗
xx
xx
S = E(ε2i xi xi )
= E[E(ε2i xi )xi xi ]
|
= E(zi α xi xi )
(since E(ε2i xi ) = z i α by (vi)).
|
Since xi is i.i.d. by (ii) and since zi is a function of xi , zi αxi xi is i.i.d. So its sample mean converges in probability to its population mean E( zi α xi xi ), which equals S. The sample mean can be written as 1 n
n
zi αxi xi
i=1
1 = n =
n
vi xi xi
(by the definition of v i , where vi is the i-th diagonal element of V )
i=1
1 X VX. n
8. See the hint. 9. (a) E(gt gt−1 , gt−2 , . . . , g2 ) = E[E(gt εt−1 , εt−2 , . . . , ε1 ) gt−1 , gt−2 , . . . , g2 ] (by the Law of Iterated Expectations) = E[E(εt εt−1 εt−1 , εt−2 , . . . , ε1 ) gt−1 , gt−2 , . . . , g2 ] = E[εt−1 E(εt εt−1 , εt−2 , . . . , ε1 ) gt−1 , gt−2 , . . . , g2 ] (by the linearity of conditional expectations) =0 (since E(εt εt−1 , εt−2 , . . . , ε1 ) = 0).
|
| ·
| |
|
|
| |
4
(b) E(gt2 ) = E(ε2t ε2t−1 )
·
= E[E(ε2t ε2t−1 εt−1 , εt−2 , . . . , ε 1 )]
· | = E[E(ε |ε 1 , ε 2 t
(by the Law of Total Expectations)
2
t−2 , . . . , ε1 )εt−1 ]
t−
= E(σ 2 ε2t−1 )
(by the linearity of conditional expectations)
(since E(ε2t εt−1 , εt−2 , . . . , ε1 ) = σ 2 )
|
= σ 2 E(ε2t−1 ). But
E(ε2t−1 ) = E[E(ε2t−1 εt−2 , εt−3 , . . . , ε1 )] = E(σ 2 ) = σ 2 .
|
(c) If εt is ergodic stationary, then εt εt−1 is ergodic stationary (see, e.g., Remark 5.3 on p. 488 of S. Karlin and H. Taylor, A First Course in Stochastic Processes , 2nd. ed., Academic Press, 1975, which states that “For any function φ, the sequence Y n = φ(X n , X n+1 , . . . ) generates an ergodic stationary process whenever X n is ergodic stationary”.) Thus the Billingsley CLT (see p. 106 of the text) is applicable to nγ1 = n n1 n t=j +1 gt .
{ }
{ ·
}
{ } √
√
(d) Since ε 2t is ergodic stationary, γ0 converges in probability to E(ε2t ) = σ 2 . As shown in (c), 4 nγ1 n γ d N (0, σ ). So by Lemma 2.4(c) d N (0, 1). γ
√ →
√ → 1
0
10. (a) Clearly, E(yt ) = 0 for all t = 1, 2, . . . .
Cov(yt , yt−j ) =
(1 + θ12 + θ22 )σε2 (θ1 + θ1 θ2 )σε2 θ2 σε2 0
for j = 0 for j = 1, for j = 2, for j > 2,
So neither E(yt ) nor Cov(yt , yt−j ) depends on t. (b) E(yt yt−j , yt−j −1 , . . . , y0 , y−1 ) = E(yt εt−j , εt−j −1 , . . . , ε0 , ε−1 ) (as noted in the hint) = E(εt + θ1 εt−1 + θ2 εt−2 εt−j , εt−j −1 , . . . , ε0 , ε−1 )
|
=
|
|
εt + θ1 εt−1 + θ2 εt−2 θ1 εt−1 + θ2 εt−2 θ2 εt−2 0
for for for for
which gives the desired result.
5
j = 0, j = 1, j = 2, j > 2,
(c)
√
Var( n y) = =
1 [Cov(y1 , y1 + n
··· + y
n)
··· + Cov(y
n , y1 +
··· + y
1 [(γ 0 + γ 1 + + γ n−2 + γ n−1 ) + (γ 1 + γ 0 + γ 1 + n + + (γ n−1 + γ n−2 + + γ 1 + γ 0 )]
···
···
=
+
1 [nγ 0 + 2(n n
··· + γ
n−2 )
···
− 1)γ 1 + ··· + 2(n − j)γ + ··· + 2γ
n−1 ]
j
−
n−1
= γ 0 + 2
n )]
j γ j . n
1
j =1
(This is just reproducing (6.5.2) of the book.) Since γ j = 0 for j > 2, one obtains the desired result. (d) To use Lemma 2.1, one sets z n = ny. However, Lemma 2.1, as stated in the book, inadvertently misses the required condition that there exist an M > 0 such that E( zn s+δ ) < M for all n for some δ > 0. Provided this technical condition is satisfied, the variance of the limiting distribution of ny is the limit of Var( ny), which is γ 0 + 2(γ 1 + γ 2 ).
√
√
| |
√
11. (a) In the auxiliary regression, the vector of the dependent variable is e and the matrix of . regressors is [ X .. E]. Using the OLS formula,
1
α = B
1
−
n
X e
1
E e n
.
X e = 0 by the normal equations for the original regression. The j -th element of n1 E e is
1 (ej +1 e1 + n
···
1 + en en−j ) = nt
n
et et−j .
=j +1
which equals γj defined in (2.10.9). (b) The j-th column of n1 X E is n1 n t=j +1 xt et−j (which, incidentally, equals µj defined on p. 147 of the book). Rewrite it as follows.
1 nt
n
·
· − − · − xt et−j
=j +1
1 nt
n
1 = nt
n
=
xt (εt−j
xt−j (b
xt εt−j
1 nt
β ))
=j +1
=j +1
n
xt xt−j
(b
=j +1
The last term vanishes because b is consistent for β . Thus n1 probability to E(xt εt−j ). The (i, j) element of the symmetric matrix n1 E E is, for i
·
1 (e1+i−j e1 + n
···
1 + en−j en−i ) = nt
n t=j +1
xt et−j converges in
≥ j,
n−j
=1+i−j
6
− β)
et et−(i−j ) .
·
Using the relation et = ε t
− x (b − β), this can be rewritten as
t
n−j
1 nt
εt εt−(i−j )
=1+i−j
−
− (b −
n−j
1 nt
(xt εt−(i−j ) + xt−(i−j ) εt ) (b
=1+i−j
1 β) nt
n−j
xt xt−(i−j ) (b
=1+i−j
− β)
− β).
The type of argument that is by now routine (similar to the one used on p. 145 for (2.10.10)) shows that this expression converges in probability to γ i−j , which is σ 2 for i = j and zero for i = j.
(c) As shown in (b), plim B = B. Since Σ is non-singular, B is non-singular. So B−1 converges in probability to B−1 . Also, using an argument similar to the one used in (b) for showing that plim n1 E E = I p , we can show that plim γ = 0 . Thus the formula in (a) shows that α converges in probability to zero. xx
1
(d) (The hint should have been: “ n E e = γ . Show that
SSR
n
=
1 n
ee
the auxiliary regression can be written as
−
α
0 .” The SSR from γ
. . 1 1 SSR = (e [X .. E]α) (e [X .. E ]α) n n . 1 = (e [X .. E]α) e (by the normal equation for the auxiliary regression) n . 1 1 = e e α [X .. E] e n n
− − −
1 = e e n =
1 ee n
−
− − − − ≡ ≡ − − 1
α
n
X e
1
n
α
E e
0
γ
(since X e = 0 and
1 E e = γ ). n
As shown in (c), plim α = 0 and plim γ = 0 . By Proposition 2.2, we have plim n1 e e = σ 2 . Hence SSR/n (and therefore SSR /(n K p)) converges to σ 2 in probability. (e) Let
R
0
( p×K )
.. . I p
. [X .. E ].
, V
The F -ratio is for the hypothesis that R α = 0 . The F -ratio can be written as 1
−
(Rα) R(V V)−1 R (Rα)/p F = . SSR/(n K p)
7
( )
∗
Using the expression for α in (a) above, R α can be written as
Rα =
.. . I p
0
( p×K )
=
0
(K ×1)
1
−
B
γ
( p×1)
.. . I p
0
( p×K )
= B22 γ .
B11 (K ×K ) B21 ( p×K )
B12 (K × p) B22 ( p× p)
0
(K ×1)
γ
( p×1)
( )
∗∗
Also, R (V V)−1 R in the expression for F can be written as R(V V)−1 R =
1 R B−1 R n
1 = n
0
( p×K )
(since .. . I p
1 V V = B) n
B11 (K ×K ) B21 ( p×K )
B12 (K × p) B22 ( p× p)
0
(K × p)
I p
1 22 B . n Substitution of ( ) and ( ) into ( ) produces the desired result. (f) Just apply the formula for partitioned inverses. =
∗∗∗
√ − √ · −
∗∗
(
∗ ∗ ∗)
∗
→ − − → → − −
(g) Since nρ nγ /σ2 p 0 and Φ p Φ, it should be clear that the modified Box-Pierce Q (= n ρ (I p Φ)−1 ρ) is asymptotically equivalent to n γ (I p Φ)−1 γ /σ 4 . Regarding the pF statistic given in (e) above, consider the expression for B22 given in (f) above. Since the j -th element of n1 X E is µj defined right below (2.10.19) on p. 147, we have
→
1 1 E X S−1 XE , n n
s2 Φ =
so
xx
B22 =
1 EE n
As shown in (b), n1 E E p σ 2 I p . Therefore, B22 cally equivalent to n γ (I p Φ)−1 γ /σ 4 .
1
−
s2 Φ
.
1 p σ2 (I p
Φ)−1 , and pF is asymptoti-
12. The hints are almost as good as the answer. Here, we give solutions to (b) and (c) only. (b) We only prove the first convergence result. 1 n
r
xt xt =
t=1
r n
1 r
r
xt xt = λ
t=1
The term in parentheses converges in probability to Σ (c) We only prove the first convergence result. r
√ 1 n
t=1
xt εt =
·
xx
1 r
r
xt xt
.
t=1
as n (and hence r) goes to infinity.
√ · √ √ · r n
1 r
r
xt εt =
t=1
λ
1 r
r
xt εt
.
t=1
The term in parentheses converges in distribution to N (0, σ 2 Σ ) as n (and hence r) goes to infinity. So the whole expression converges in distribution to N (0, λ σ 2 Σ ). xx
xx
8
December 27, 2003
Hayashi Econometrics
Solution to Chapter 3 Analytical Exercises 1. If A is symmetric and idempotent, then A = A and AA = A. So x Ax = x AAx = x A Ax = z z 0 where z Ax.
≥
≡
2. (a) By assumption, xi , εi is jointly stationary and ergodic, so by ergodic theorem the first term of ( ) converges almost surely to E( x2i ε2i ) which exists and is finite by Assumption 3.5. (b) zi x2i εi is the product of x i εi and x i zi . By using the Cauchy-Schwarts inequality, we obtain
{
∗
}
|≤
E(x2i ε2i ) E(x2i zi2 ).
E( xi εi xi zi )
|
·
E(x2i ε2i ) exists and is finite by Assumption 3.5 and E( x2i zi2 ) exists and is finite by Assumption 3.6. Therefore, E( xi zi xi εi ) is finite. Hence, E(xi zi xi εi ) exists and is finite. (c) By ergodic stationarity the sample average of z i x2i εi converges in probability to some finite number. Because δ is consistent for δ by Proposition 3.1, δ δ converges to 0 in probability. Therefore, the second term of ( ) converges to zero in probability. (d) By ergodic stationarity and Assumption 3.6 the sample average of z i2 x2i converges in probability to some finite number. As mentioned in (c) δ δ converges to 0 in probability. Therefore, the last term of ( ) vanishes.
|
·
|
·
−
∗
−
∗
3. (a) Q
≡
= =
−1 Σxz WΣxz xz WΣxz (Σxz WSWΣxz ) −1 −1 Σxz C CΣxz C WΣxz )−1 Σxz WΣxz xz WΣxz (Σxz WC H H Σxz WΣxz (G G)−1 Σxz WΣxz −1
1
−
Σxz S
Σxz
−Σ −Σ
− H H − H G(G G) G H H [I − G(G G) G ]H
= = K = H MG H.
1
−
(b) First, we show that M G is symmetric and idempotent. MG
MG MG
= = = =
−
−
−
−
1
IK
IK
−
1
−
MG .
= IK IK G(G G) 1 G IK = IK G(G G) 1 G = MG .
−
1
− G(G(G G) ) − G((G G) G ) − G(G G) G
IK
− I
K
G(G G)
1
−
1
−
G + G(G G)
Thus, MG is symmetric and idempotent. For any L-dimensional vector x ,
x Qx
= x H MG Hx = z MG z (where z Hx) 0 (since M G is positive semidefinite) .
≡
≥
Therefore, Q is positive semidefinite. 1
1
−
G G(G G)
G
4. (the answer on p. 254 of the book simplified) If W is as defined in the hint, then
WSW = W
1
−
and Σxz WΣxz = Σzz A
Σzz .
So (3.5.1) reduces to the asymptotic variance of the OLS estimator. By (3.5.11), it is no smaller than (Σxz S 1 Σxz ) 1 , which is the asymptotic variance of the efficient GMM estimator.
−
−
− − − − − − −
5. (a) From the expression for δ (S 1 ) (given in (3.5.12)) and the expression for gn (δ ) (given in (3.4.2)), it is easy to show that g n (δ (S 1 )) = Bsxy . But Bsxy = Bg because −
−
1
Sxz )
−
1
Sxz )
−
−
−
Bsxy = (IK
Sxz (Sxz S
= (IK
Sxz (Sxz S
Sxz S
1
Sxz S
Sxz )
1
= C C, we obtain B S
1
1
1
−
= (Sxz
Sxz (Sxz S
= (Sxz
Sxz )δ + Bg
−
1
)sxy
1
)(Sxz δ + g)
(since yi = zi δ + εi )
1
Sxz (Sxz S
−
−
−
Sxz S
Sxz )δ + ( IK
1
−
Sxz )
1
−
= Bg.
(b) Since S
1
−
CB
−
1
−
)g
B = B C CB = (CB) (CB). But
= C(IK Sxz (Sxz S 1 Sxz ) 1 Sxz S 1 ) = C CSxz (Sxz C CSxz ) 1 Sxz C C = C A(A A) 1 A C (where A CSxz ) = [IK A(A A) 1 A ]C
Sxz S
− −
≡
−
−
−
−
−
−
≡
MC.
So B S 1 B = (MC) (MC) = C M MC. It should be routine to show that M is symmetric and idempotent. Thus B S 1 B = C MC. The rank of M equals its trace, which is
−
−
trace(M) = trace(IK A(A A) 1 A ) = trace(IK ) trace(A(A A)
−
− − A) ) − trace(A A(A A) )
= trace(IK = K trace(IL ) = K L.
− −
1
−
1
−
(c) As defined in (b), C C = S 1 . Let D be such that D D = S 1 . The choice of C and D is not unique, but it would be possible to choose C so that plim C = D. Now,
−
v
−
≡ √ n(Cg) = C(√ n g).
By using the Ergodic Stationary Martingale Differences CLT, we obtain So
√ → N (0, Avar(v))
v = C( n g)
d
where Avar(v) = = = = 2
DSD
D(D D) 1
−
DD IK .
1
−
1
−
D
D
D
√ n g →
d
N (0, S).
(d)
J (δ (S
1
−
1
−
), S
1
1
−
) = n gn (δ (S
−
1
−
· )) S g (δ (S )) n · (Bg) S (Bg) (by (a)) n · g B S Bg n · g C MCg (by (b)) √ v Mv (since v ≡ nCg). 1
= = = =
n
−
1
−
Since v d N (0, IK ) and M is idempotent, v Mv is asymptotically chi-squared with degrees of freedom equaling the rank of M = K L.
→
1
6. From Exercise 5, J = n g B S
·
−
−
Bg. Also from Exercise 5, Bg = Bsxy .
7. For the most parts, the hints are nearly the answer. Here, we provide answers to (d), (f), (g), (i), and (j). (d) As shown in (c), J 1 = v1 M1 v1 . It suffices to prove that v1 = C1 F C
≡ √ nC g √ = nC F g √ = nC F C Cg √ nCg =C FC
v1
1
v.
1
1
1
1
1
1
−
−
1
−
−
1
= C1 F C
v (since v
≡ √ nCg).
(f) Use the hint to show that A D = 0 if A1 M1 = 0. It should be easy to show that A1 M1 = 0 from the definition of M1 .
(g) By the definition of M in Exercise 5, MD = D A(A A) 1 A D. So MD = D since A D = 0 as shown in the previous part. Since both M and D are symmetric, DM = D M = (MD) = D = D. As shown in part (e), D is idempotent. Also, M is idempotent as shown in Exercise 5. So (M D)2 = M 2 DM MD + D2 = M D. As shown in Exercise 5, the trace of M is K L. As shown in (e), the trace of D is K 1 L. So the trace of M D is K K 1 . The rank of a symmetric and idempotent matrix is its trace.
−
−
−
− −
−
−
−
−
−
g C DCg = g FC1 M1 C1 F g
1
−
1
−
= g1 B1 (S11 )
1
−
B1 F g (since C1 M1 C1 = B1 (S11 )
B1 g1 (since g1 = F g).
From the definition of B1 and the fact that sx B1 sx y . So
1
y
B1 from (a))
= Sx z δ + g1 , it follows that B1 g1 = 1
1
1
−
g1 B1 (S11 )
1
−
B1 g1 = sx y B1 (S11 ) 1
B1 sx
1
−
= sxy FB1 (S11 )
1
y
B1 F sxy
(since s x y = F sxy ) 1
= sxy C DCsxy .
3
1
−
= sxy FC1 M1 C1 F sxy (since B1 (S11 )
B.
(C DC = FC1 M1 C1 F by the definition of D in (d))
= g FB1 (S11 )
1
−
(i) It has been shown in Exercise 6 that g C MCg = sxy C MCsxy since C MC = B S Here, we show that g C DCg = sxy C DCsxy .
B1 = C1 M1 C1 from (a))
(j) M
− D is positive semi-definite because it is symmetric and idempotent.
8. (a) Solve the first-order conditions in the hint for δ to obtain 1 (S WSxz ) 2n xz
−
δ = δ (W)
1
−
R λ.
Substitute this into the constraint Rδ = r to obtain the expression for λ in the question. Then substitute this expression for λ into the above equation to obtain the expression for δ in the question. (b) The hint is almost the answer.
· − − √ √ − √ −
(c) What needs to be shown is that n (δ (W) δ ) (Sxz WSxz )(δ (W) δ ) equals the Wald statistic. But this is immediate from substitution of the expression for δ in (a). 9. (a) By applying (3.4.11), we obtain n(δ 1
δ)
n(δ 1
δ)
(Sxz W1 Sxz )
=
1
Sxz W1
1
Sxz W2
−
−
(Sxz W2 Sxz )
ng.
By using Billingsley CLT, we have
√ ng → N (0, S). d
Also, we have
→
(Sxz W1 Sxz )
√ √ − → − n(δ 1 n(δ 1
δ)
d
δ)
=
(b)
Sxz W1
1
Sxz W2
−
(Sxz W2 Sxz )
Therefore, by Lemma 2.4(c),
1
−
p
Q1 1 Σxz W1 . Q2 1 Σxz W2 −
−
N 0, N 0,
A11 A21
−
−
A12 A22
−
−
.
√ nq can be rewritten as √ nq = √ n(δ − δ ) = √ n(δ − δ) − √ n(δ − δ) =
1
2
1
Therefore, we obtain
2
− √ √ −− 1
1
n(δ 1
δ)
n(δ 2
δ)
√ nq → N (0, Avar(q)). d
where
−
Avar(q) = 1
1
. Q1 1 Σxz W1 1 . ( . W2 Σxz Q2 1 ) S W Σ Q xz 1 1 Q2 1 Σxz W2
A11 A21
A12 A22
1 = A11 + A22 1
−
4
−A −A 12
21 .
.
1
(c) Since W2 = S
−
, Q 2 , A12 , A 21 , and A22 can be rewritten as follows:
= Σxz W2 Σxz = Σxz S 1 Σxz ,
Q2
−
= Q1 1 Σxz W1 S S 1 Σxz Q2 1 = Q1 1 (Σxz W1 Σxz )Q2 1 = Q1 1 Q1 Q2 1 = Q2 1 ,
−
A12
−
−
−
−
−
−
−
= Q2 1 Σxz S = Q2 1 , −
A21
1
−
SW1 Σxz Q1 1 −
−
1
Σxz )
1
Σxz )
−
−
= (Σxz S = (Σxz S = Q2 1 .
A22
1
−
Σxz S
1
−
1
−
SS
Σxz (Σxz S
1
−
1
−
Σxz )
1
−
−
Substitution of these into the expression for Avar( q) in (b), we obtain Avar(q) =
A11
= A11
1
−
−Q − (Σ
2
= Avar(δ (W1 )) 10. (a)
≡ E(x z )
σxz
i
i
1
−
xz S
1
−
Σxz )
− Avar(δ(S
1
−
)).
= E(xi (xi β + vi )) = β E(x2i ) + E(xi vi ) = βσ x2 = 0 (by assumptions (2), (3), and (4)).
−
(b) From the definition of δ , δ
δ =
1
n
1
−
n
xi zi
n
1
n
i=1
i=1
xi εi = s xz1 −
1 n
n
xi εi .
i=1
We have xi zi = xi (xi β + v i ) = x2i β + xi vi , which, being a function of (xi , η i ), is ergodic stationary by assumption (1). So by the Ergodic theorem, sxz p σxz . Since σxz = 0 by (a), we have s xz1 p σ xz1 . By assumption (2), E( xi εi ) = 0. So by assumption (1), we have n 1 xi εi p 0. Thus δ δ p 0. i=1 n −
(c)
→
→
sxz
→
−
− → ≡ 1
n
=
1
n
n
xi zi
i=1 n
(x2i β + xi vi )
i=1
1 1
n
n
1
2
√ n n x + n → 0 · E(x ) + E(x v ) =
i
i=1
=
i=1
2
p
i
i
0
5
i
xi vi (since β =
√ 1n )
(d)
√ ns
xz
=
n
1 n
n
√ 1
2
xi +
n
i=1
xi vi .
i=1
By assumption (1) and the Ergodic Theorem, the first term of RHS converges in probability to E(x2i ) = σ x2 > 0. Assumption (2) and the Martingale Differences CLT imply that n
√ 1
n
xi vi
i=1
→ a ∼ N (0, s
22
d
).
Therefore, by Lemma 2.4(a), we obtain
√ ns → σ + a. xz
−
(e) δ
2
d
x
δ can be rewritten as
− δ
√
δ = ( nsxz )
1
−
√ ng . 1
From assumption (2) and the Martingale Differences CLT, we obtain
√ ng → b ∼ N (0, s 1
d
11
).
where s11 is the (1, 1) element of S . By using the result of (d) and Lemma 2.3(b),
− → √ √ √ δ
δ
d
(σx2 + a)
1
−
b.
(a, b) are jointly normal because the joint distribution is the limiting distribution of n g =
−
ng 1
n( n1
n
i=1
xi vi )
(f) Because δ δ converges in distribution to ( σx2 + a)
1
−
6
.
b which is not zero, the answer is No.
January 8, 2004, answer to 3(c)(i) simplified, February 23, 2004
Hayashi Econometrics
Solution to Chapter 4 Analytical Exercises
1. It should be easy to show that Amh = n1 Zm PZh and that cmh = n1 Zm Pyh . Going back to the formula (4.5.12) on p. 278 of the book, the first matrix on the RHS (the matrix to be inverted) is a partitioned matrix whose ( m, h) block is Amh . It should be easy to see that it equals 1 n
1
−
Z (Σ
1
−
[Z (Σ n
⊗ P)y.
1
⊗ P)Z]. Similarly, the second matrix on the RHS of (4.5.12) equals
2. The sprinkled hints are as good as the answer. 3. (b) (amplification of the answer given on p. 320) In this part only, for notational brevity, let zi be a m Lm × 1 stacked vector collecting ( zi1 , . . . , ziM ).
E(εim | Z) = E( εim | z1 , z2 , . . . , zn ) (since Z collects z i ’s) = E( εim | zi ) (since (εim , zi ) is independent of zj ( j = i )) =0 (by the strengthened orthogonality conditions). The (i, j ) element of the n × n matrix E(εm εh | Z) is E(εim εjh | Z). E(εim εjh | Z) = E(εim εjh | z1 , z2 , . . . , zn ) = E( εim εjh | zi , zj ) (since (εim , zi , εjh , zj ) is independent of z k (k = i, j )). For j = i , this becomes E(εim εjh | zi , zj ) = E [E(εim εjh | zi , zj , εjh ) | zi , zj ] = E [εjh E(εim | zi , zj , εjh ) | zi , zj ]
(by the Law of Iterated Expectations) (by linearity of conditional expectations)
= E [εjh E(εim | zi ) | zi , zj ] (since (εim , zi ) is independent of (εjh , zj )) =0 (since E(εim | zi ) = 0). For j = i , E(εim εjh | Z) = E(εim εih | Z) = E(εim εih | zi ). Since xim = xi and xi is the union of ( zi1 , . . . , ziM ) in the SUR model, the conditional homoskedasticity assumption, Assumption 4.7, states that E( εim εih | zi ) = E(εim εih | xi ) = σ mh . (c)
(i) We need to show that Assumptions 4.1-4.5, 4.7 and (4.5.18) together imply Assumptions 1.1-1.3 and (1.6.1). Assumption 1.1 (linearity) is obviously satisfied. Assumption 1.2 (strict exogeneity) and (1.6.1) have been verified in 3(b). That leaves Assumption 1.3 (the rank condition that Z (defined in Analytical Exercise 1) be of full column rank). Since Z is block diagonal, it suffices to show that Zm is of full column rank for m = 1, 2, . . . , M . The proof goes as follows. By Assumption 4.5, 1
S is non-singular. By Assumption 4.7 and the condition (implied by (4.5.18)) that the set of instruments be common across equations, we have S = Σ ⊗ E(xi xi ) (as in (4.5.9)). So the square matrix E( xi xi ) is non-singular. Since n1 X X (where X is the n × K data matrix, as defined in Analytical Exercise 1) converges almost surely to E(xi xi ), the n × K data matrix X is of full column rank for sufficiently large n. Since Zm consists of columns selected from the columns of X, Zm is of full column
rank as well. (ii) The hint is the answer. (iii) The unbiasedness of δSUR follows from (i), (ii), and Proposition 1.7(a). (iv) Avar(δSUR ) is (4.5.15) where Amh is given by (4.5.16 ) on p. 280. The hint shows that it equals the plim of n · Var(δSUR | Z).
(d) For the most part, the answer is a straightforward modification of the answer to (c). The only part that is not so straightforward is to show in part (i) that the M n × L matrix Z is of full column rank. Let Dm be the Dm matrix introduced in the answer to (c), so zim = D m xi and Zm = XD m . Since the dimension of xi is K and that of zim is L, the M matrix Dm is K × L. The m=1 K m × L matrix Σxz in Assumption 4.4 can be written as
D1
Σxz
(KM ×L)
= [ IM ⊗ E(xi xi )]D where
D
(KM ×L)
≡
.. .
.
DM
Since Σxz is of full column rank by Assumption 4.4 and since E(xi xi ) is non-singular, D is of full column rank. So Z = (IM ⊗ X)D is of full column rank if X is of full column rank. X is of full column rank for sufficiently large n if E(xi xi ) is non-singular. 4. (a) Assumptions 4.1-4.5 imply that the Avar of the efficient multiple-equation GMM estimator is (Σxz S−1 Σxz )−1 . Assumption 4.2 implies that the plim of Sxz is Σxz . Under Assumptions 4.1, 4.2, and 4.6, the plim of S is S . (b) The claim to be shown is just a restatement of Propositions 3.4 and 3.5.
(c) Use (A9) and (A6) of the book’s Appendix A. Sxz and W are block diagonal, so WSxz (Sxz WSxz )−1 is block diagonal. (d) If the same residuals are used in both the efficient equation-by-equation GMM and the efficient multiple-equation GMM, then the S in (∗∗) and the S in (Sxz S−1 Sxz )−1 are numerically the same. The rest follows from the inequality in the question and the hint.
(e) Yes. (f) The hint is the answer. 5. (a) For the LW 69 equation, the instruments (1 , MED ) are 2 in number while the number of the regressors is 3. So the order condition is not satisfied for the equation. (b) (reproducing the answer on pp. 320-321)
1 1
E(S69 ) E(IQ ) E(S80 ) E(IQ ) E(MED ) E(S69 · MED ) E(IQ · MED ) E(MED ) E(S80 · MED ) E(IQ · MED )
E(LW69 ) β 0 E(LW80 ) β 1 = . E(LW69 · MED ) β 2 E(LW80 · MED )
The condition for the system to be identified is that the 4 × 3 coefficient matrix is of full column rank. 2
(c) (reproducing the answer on p. 321) If IQ and MED are uncorrelated, then E( IQ · MED ) = E(IQ ) · E(MED ) and the third column of the coefficient matrix is E( IQ ) times the first column. So the matrix cannot be of full column rank.
6. (reproducing the answer on p. 321) εim = y im − zim δm = ε im − zim (δm − δm ). So 1 n
n
[εim − zim (δm − δ m )][εih − zih (δh − δ h )] = (1) + (2) + (3) + (4) ,
i=1
where (1) =
n
1 n
εim εih ,
i=1
(2) = −(δm − δm )
n
(4) = (δm − δ m )
zih · εim ,
n
zim · εih ,
i=1 n
1
(3) = −(δ h − δh )
n
1
i=1
n
1
n
zim zih (δ h − δ h ).
i=1
As usual, under Assumption 4.1 and 4.2, (1) →p σ mh (≡ E(εim εih )). For (4), by Assumption 4.2 and the assumption that E(zim zih ) is finite, zim zih converges in probability to a (finite) matrix. So (4) →p 0. Regarding (2), by Cauchy-Schwartz, E(|zimj · εih |) ≤
1 n
i
·
2 ) · E(ε2 ), E(zimj ih
where z imj is the j -th element of zim . So E(zim · εih ) is finite and (2) →p 0 because δm − δ m → p 0. Similarly, (3) →p 0.
7. (a) Let B, Sxz , and W be as defined in the hint. Also let
1
n
=
sxy (MK ×1)
.. .
n i=1
1
n
Then
n i=1
xi · yi1
xi · yiM
.
1
−
δ 3SLS =
Sxz WSxz
1
−
= (I ⊗ B )(Σ −
= Σ
1
Sxz Wsxy
−
1 ⊗ S− )(I ⊗ B) xx
1 B ⊗ B S− xx
1
−
1 = Σ ⊗ (B S− B)−1 xx
1
−
Σ
1
−
Σ
1
1 sxy ⊗ B S− xx
1 ⊗ B S− sxy xx
1 1 B)−1 B S− sxy = IM ⊗ (B S− xx xx
=
1 B)−1 B S−1 (B S− xx xx .. . −
xx
n i=1
xi · yi1
,
1 B)−1 B S−1 1
(B Sxx
1
n
n
3
n i=1
−
(I ⊗ B )(Σ
xi · yiM
1
1 ⊗ S− )sxy xx
which is a stacked vector of 2SLS estimators. (b) The hint is the answer. 8. (a) The efficient multiple-equation GMM estimator is
Sxz S−1 Sxz
1
−
Sxz S−1 sxy ,
where S xz and s xy are as defined in (4.2.2) on p. 266 and S−1 is a consistent estimator of S. Since xim = z im here, Sxz is square. So the above formula becomes
1 S− S Sxz xz
−
1
1 Sxz S−1 sxy = S − s , xz xy
which is a stacked vector of OLS estimators.
(b) The SUR is efficient multiple-equation GMM under conditional homoskedasticity when the set of orthogonality conditions is E(zim · εih ) = 0 for all m, h. The OLS estimator derived above is (trivially) efficient multiple-equation GMM under conditional homoskedasticity when the set of orthogonality conditions is E(zim · εim ) = 0 for all m. Since the sets of orthogonality conditions differ, the efficient GMM estimators differ. 9. The hint is the answer (to derive the formula in (b) of the hint, use the SUR formula you derived in Analytical Exercise 2(b)).
1 10. (a) Avar(δ1,2SLS ) = σ 11 A− 11 .
(b) Avar(δ1,3SLS ) equals G −1 . The hint shows that G =
1 σ11
A11 .
11. Because there are as many orthogonality conditions as there are coefficients to be estimated, it is possible to choose δ so that gn (δ ) defined in the hint is a zero vector. Solving
n
1
n
zi1 ·yi1 + · · · +
i=1
1 n
n
ziM ·yiM −
i=1
1
n
n
zi1 zi1 + · · · +
i=1
1 n
n
ziM ziM δ = 0
i=1
for δ, we obtain
δ =
1
n
n
i=1
zi1 zi1 + · · · +
1 n
n
i=1
1
−
ziM ziM
1
n
which is none other than the pooled OLS estimator.
4
n
i=1
zi1 ·yi1 + · · · +
1 n
n
i=1
ziM ·yiM ,
https://www.coursehero.com/file/8818913/analqs-ch5/
https://www.coursehero.com/file/8818913/analqs-ch5/
https://www.coursehero.com/file/8818913/analqs-ch5/
https://www.coursehero.com/file/8818913/analqs-ch5/
https://www.coursehero.com/file/8818913/analqs-ch5/
https://www.coursehero.com/file/8818913/analqs-ch5/
https://www.coursehero.com/file/8818913/analqs-ch5/
https://www.coursehero.com/file/8818912/analqs-ch6/
https://www.coursehero.com/file/8818912/analqs-ch6/
https://www.coursehero.com/file/8818912/analqs-ch6/
https://www.coursehero.com/file/8818912/analqs-ch6/
updated: 11/23/00
Hayashi Econometrics : Answers to Selected Review Questions
Chapter 7 Section 7.1
1. m(wt ; θ ) = − [yt − Φ(xt θ )]2 . 2. Since E(yt | x t ) = Φ (xt θ 0 ), we have:
E[xt · (yt − Φ(xt θ 0 )) | x t ] = x t E[yt − Φ(xt θ 0 ) | x t ] = 0. Use the Law of Total Expectations. g (wt ; θ ) = x t · (yt − Φ(xt θ )).
5. Qn is (7.1.3) with g (wt ; θ ) = x t · (yt − θ0 zt ). Qn is (7.1.3) with g (wt ; θ ) = x t · (zt − λ0 yt ). Section 7.2
2. Sufficiency is proved in the text. To show necessity, suppose (7.2.10) were false. Then there exists a θ1 in Θ such that φ(xt ; θ 1 ) = φ (xt ; θ 0 ). Then from (7.2.9), E[ {yt − φ(xt ; θ 1 )}2 ] = E[{yt − φ(xt ; θ0 )}2 ]. This is a contradiction because θ0 is the only maximizer. 3. What needs to be proved is: “E(xt xt ) nonsingular” ⇒ “xt θ = xt θ0 for
θ
=
θ 0 ”.
Use the
argument developed in Example 7.8. = Φ (xt θ 0 ) for θ = θ 0 ”. It was 4. What needs to be proved is: “E( xt xt ) nonsingular” ⇒ “Φ(xt θ ) shown in the previous review question that the nonsingularity condition implies x t θ = x t θ 0 for θ = θ 0 . 7. The Hessian matrix for linear GMM is negative definite. So the objective function is strictly
concave. = 0 for 8. So the identification condition is E[ g(wt ; θ 0 )] = 0 and W E[g(wt ; θ )]
= θ 0 . θ
Section 7.3
1. A better question would be as follows.
Consider a random sample ( w1 , . . . , wn ). Let f (wt ; θ0 ) be the density of wt , where θ 0 is the p-dimensional true parameter vector. The log likelihood of the sample is n
L(w1 , . . . , wn ; θ
log f (w ; )= t
θ ).
t=1
Let r n (θ ) be the score vector of this log likelihood function. That is, r n (θ ) is the p-dimensional gradient of L . In Chapter 1, we defined the Cramer-Rao bound 1
to be the inverse of E[ rn (θ 0 )rn (θ 0 ) ]. Define the asymptotic Cramer-Rao bound as the inverse of J ≡ lim
1
n→∞
n
E[rn (θ0 )rn (θ 0 ) ].
Assume that all the conditions for the consistency and asymptotic normality of the (unconditional) maximum likelihood estimator are satisfied. Show that the asymptotic variance matrix of the ML estimator equals the asymptotic CramerRao bound. The answer is as follows. Define s (wt ; θ ) as the gradient of log f (wt ; θ ). Then n
s(w ; r ( )= n
θ
t
θ ).
t=1
Since E[s(wt ; θ 0 )] = 0 and { s(wt ; θ 0 )} is i.i.d., we have n
E[rn (θ 0 )rn (θ 0 ) ] = Var(rn (θ 0
Var(s(w ; )) = t
θ 0 )) = n
· E[s(wt ; θ0 )s(wt ; θ0 ) ].
t=1
By the information matrix equality, it follows that 1 n
E[ rn (θ 0 )rn (θ 0 ) ] = − E[H(wt ; θ 0 )],
where H(wt ; θ ) is the hessian of the log likelihood for observation t. Therefore, trivially, the limit as n → ∞ of n1 E[rn (θ 0 )rn (θ 0 ) ] is − E[H(wt ; θ 0 )], which is the inverse of the asymptotic variance matrix.
2
https://www.coursehero.com/file/8818909/analqs-ch8/
https://www.coursehero.com/file/8818909/analqs-ch8/
https://www.coursehero.com/file/8818909/analqs-ch8/
https://www.coursehero.com/file/8818911/analqs-ch9/
https://www.coursehero.com/file/8818911/analqs-ch9/
https://www.coursehero.com/file/8818911/analqs-ch9/
https://www.coursehero.com/file/8818911/analqs-ch9/
https://www.coursehero.com/file/8818911/analqs-ch9/
2.1 Review of Limit Theorems for Sequences of Random Variables Four Modes of Convergence Let (z 1, z 2, . . .) (written as z n ) be a sequence of scalar random variables. We write
{ }
1. Convergence in probability. z n converges in probability to a constant (non-random) α if, for any ε > 0,
{ }
lim Prob( z n
| − α| > ε) = 0
n→∞
or
lim Prob( z n
| − α| < ε) = 1.
n→∞
(2.1.1)
The constant α is called the probability limit of z n and is written as “plimn α” or “z n p α”.
→∞
z n =
→
2. Almost sure convergence. z n converges almost surely to a constant α if
{ }
�
�
Prob lim z n = α = 1. We write this as “z n
→
a.s.
n→∞
(2.1.3)
α.”
3. Mean square convergence. z n converges in mean square (or in quadratic mean ) to α (written as “z n m.s. α”) if lim E[(z n α)2 ] = 0. (2.1.4)
→
{ }
−
n→∞
4. Convergence in distribution. Let F n be the cumulative distribution function (c.d.f.) of z n . z n converges in distribution to a random scalar z if
{ }
the c.d.f. F n of z n converges to the c.d.f. F of z at every continuity point of F . “z n d z ” or “z n L z ” and call F the asymptotic (or limit or limiting) distribution of z n . Sometimes we write “z n d F ,” when the distribution F is well-known. For example, “z n d N (0, 1)” should read “z n d z and the distribution of z is N (0, 1) (normal distribution with mean 0 and variance 1).”
→ →
→
→ →
• It can be shown from the definition that “z − z → 0” ⇒ “z → z .” n
n
p
(2.1.6)
d
• Lemma 2.1 (convergence in distribution and in moments): Suppose that, for some δ > 0 , E( |z | ) < M < ∞ for all n . Let α be the s -th moment of z and lim α = n
s+δ
sn
αs where αs is finite (i.e., a real number). Then:
“ z n
→ z ” ⇒ “ α is the s-th moment of z .” d
s
1
n
n→∞
sn
Extension to random vectors
• For “→ ”, “→
a.s. ”,
p
“
m.s. ”,
→
just require element-by-element convergence. For example,
zn
→ α ⇔ z → α p
• For “→ ”, the definition is: z →
nk
p
k
for all k.
z if the joint c.d.f. F n of the random vector zn converges to the joint c.d.f. F of z at every continuity point of F . Element-by-element convergence does not necessarily mean convergence for the vector sequence. That is, “each element of zn d corresponding element of z ” does not necessarily imply “zn d z .” d
n
d
→
→
• Multivariate Convergence in Distribution Theorem: “ z → z” ⇔ “ λ z → λ z for any K -dimensional vector of real numbers.” ′
n
′
n
d
d
Relation among the Four Modes of Convergence Lemma 2.2 (relationship among the four modes of convergence): (a) “ zn
→ (b) “ z → (c) “ z →
⇒ “ z → α.” α” ⇒ “ z → α.” α” ⇔ “ z → α.” (That is, if the limiting random variable is a constant [a trivial
m.s.
n
a.s.
n
p
α”
n
n
n
p
p
d
random variable], convergence in distribution is the same as convergence in probability.)
2
Three Useful Results Lemma 2.3 (preservation of convergence for continuous transformation): Suppose a( ) is a vector-valued continuous function that does not depend on n.
·
(a) (“Continuous Mapping Theorem”) “ zn a(zn ) = a (plimn Stated differently, plimn →∞
(b) “ zn
→
α”
p
→∞
⇒ “ a(z ) → a (α).” n
p
zn ) provided the plim exists.
→ z ” ⇒ “ a(z ) → a (z).” n
d
d
Lemma 2.4: (a) “ xn
→ x, y → α” ⇒ “ x + y → x + α.” (b) “ x → x , y → 0 ” ⇒ “ y x → 0.” (c) “ x → x , A → A ” ⇒ “ A x → Ax ,” provided that A and x are conformable. In particular, if x ∼ N (0, Σ), then A x → N (0, AΣA ). (d) “ x → x , A → A ” ⇒ “ x A x → x A x,” provided that A and x are conformable d
n
p
n
d
n
p
n
d
n
n
n
′
n
n
n
n
n
n
d
p
n
p
p
d
n
d
n
′
n
d
1
′
−
n
n
′
n
1
−
d
n
n
and A is nonsingular.
• Parts (a) and (b) are sometimes called Slutzky’s Theorem. • By setting α = 0 and z ≡ x + y , part (a) implies: “x → x, z − x → 0” ⇒ “z → x.” n
n
n
n
n
d
n
n
p
d
If z n xn p 0 , then we say that the two sequences are asymptotically equivalent . Thus, if zn and xn are asymptotically equivalent and if x n d x , then zn converges to the same random variable x .
− → { } { }
→
{ }
Lemma 2.5 (the “delta method”): Suppose xn is a sequence of K -dimensional random vectors such that xn p β and n(xn β ) z,
{ } √ − →
→
d
and suppose a( ) : RK Rr has continuous first derivatives with A(β) denoting the r first derivatives evaluated at β :
·
→
A(β) (r ×K )
Then
× K matrix of
≡ ∂ ∂ a(ββ) . ′
√ n[a(x ) − a(β)] → A(β)z. n
d
In particular:
√
“ n(xn
− β) → N (0, Σ)” ⇒ “ √ n[a(x ) − a(β)] → N �0, A(β)ΣA(β) �.” ′
n
d
3
d
Viewing Estimators as Sequences of Random Variables
• •
{ }
Let θn be an estimator of a parameter vector θ based on a sample of size n . Thus θn is a sequence of random vectors.
θ n is consistent for θ if plimn→∞ θ n = θ θ n is asymptotically normal if
→
or θn
√ n (θ − θ ) → n
d
asymptotic variance and is denoted Avar(θn ).
4
p
θ.
N (0, Σ). The variance matrix Σ is called the
Laws of Large Numbers and Central Limit Theorems For a sequence of random scalars z i , the sample mean z n is defined as
{ }
z n
1
≡ n
n
�
z i .
i=1
This creates a different sequence z n .
{ }
A Version of Chebychev’s Weak LLN: “ lim E(z n ) = µ, lim Var(z n ) = 0” n→∞
⇒ “ z → µ.” n
n→∞
Kolmogorov’s Second Strong Law of Large Numbers: Then z n
→
a.s. µ .
Let
p
{z } be i.i.d. with i
E(z i ) = µ.
These LLNs extend readily to random vectors by requiring element-by-element convergence.
Lindeberg-Levy CLT: Let zi be i.i.d. with E(zi ) = µ and Var(zi ) = Σ . Then
{ } √ n(z − µ) = √ 1 �(z − µ) → N (0, Σ). n n
n
i
i=1
5
d
2.2 Fundamental Concepts in Time-Series Analysis Define stochastic processes , time series , realizations, sample paths .
Need for Ergodic Stationarity Why do we want to impose stationarity and ergodicity?
Stationary Processes
• Definition: A stochastic process {z } (i = 1, 2, . . .) is (strictly) stationary if, for any given finite integer r and for any set of subscripts, i , i , . . . , i , the joint distribution of (z , z , z , . . . , z ) depends only on i − i, i − i, i − i , . . . , i − i but not on i. • The definition implies that any transformation (function) of a stationary process is itself stationary, that is, if {z } is stationary, then {f (z )} is. • Example 2.1: i.i.d sequences • Example 2.2: constant series • Example 2.3: element-wise vs. joint stationarity i
1
1
2
3
2
r
i
i1
i2
i
r
r
i
i
Covariance Stationary Processes
• Definition. A stochastic process {z } is weakly (or covariance) stationary if: i
(i) E(zi ) does not depend on i, and (ii) Cov(zi , zi j ) exists, is finite, and depends only on j but not on i (for example, Cov(z1 , z5 ) equals Cov(z12 , z16 )). −
• Definition. The j -th order autocovariance , denoted Γ , is defined as Γ ≡ Cov(z , z ) ( j = 0, 1, 2, . . .). j
j
i
i− j
′
Γ j satisfies Γ j = Γ j . The 0-th order autocovariance is the variance Γ 0 = Var(zi ). −
• For a scalar covariance stationary process, define autocovariance matrix, j -th order autocorrelation coefficient, correlogram.
White Noise Processes
• Definition. A covariance-stationary process {z } is white noise if E(z ) = 0 and Cov(z , z ) = 0 for j̸ = 0. • Definition. An independent white noise process is an i.i.d. sequence with mean zero and i
i
finite variance
• Example 2.4: A white noise process that is not an independent white noise process. 6
i
i− j
Ergodicity
• Loose definition. Heuristically, a stationary process is ergodic if it is asymptotically inde-
pendent, that is, if any two random variables positioned far apart in the sequence are almost independently distributed.
• If {z } is ergodcic, then so is {f (z )}. • Definition. A stationary process that is ergodic will be called ergodic stationary. • If {z } is ergodcic stationary, then so is {f (z )}. • Example: an independent white noice process. • Ergodic Theorem: Let {z } be a stationary and ergodic process with E(z ) = µ. Then 1� z ≡ z → µ. n i
i
i
i
i
i
n
n
i
i=1
a.s.
• An important implication: If {z } is ergodic stationary, then 1� f (z ) → E[f (z )] (provided the expectation exists). n i
n
i
i=1
i
a.s.
Martingales
• Definition. A vector process {z } is called a martingale if E(z | z , . . . , z ) = z for i ≥ 2. • If z includes x , then {x } is a (univariate) martingale. i
i−1
i
i
i
1
i−1
(2.2.4)
i
Random Walks
• Definition. Let {g } be a vector independent white noise process . A random walk, {z }, is a i
i
sequence of cumulative sums:
z1 = g 1 , z2 = g 1 + g2 , . . . , zi = g1 + g2 +
··· + g , ·· · .
i
(2.2.6)
• A random walk is a martingale. Martingale Difference Sequences
• Definition. A vector process {g } with E(g ) = 0 is called a martingale difference sequence i
i
(m.d.s.) or martingale differences if the expectation conditional on its past values, too, is zero: E(gi gi 1 , gi 2 , . . . , g1 ) = 0 for i 2. (2.2.9)
|
−
≥
−
• An independent white noise processes is m.d.s. • The cumulative sum {z } created from a martingale difference sequence {g } is a martingale. i
i
The converse is also true.
• Cov(g , g i
i− j
) = 0 for all i and j = 0 if gi is mds.
̸
{ }
7
Different Formulation of Lack of Serial Dependence “independent white noise”
⇒“stationary m.d.s. with finite variance.” ⇒“white noise.”
The CLT for Ergodic Stationary Martingale Differences Sequences Ergodic Stationary Martingale Differences CLT (Billingsley (1961)) Let gi be a vector n 1 martingale difference sequence that is stationary and ergodic with E(gi gi ) = Σ ,and let g i=1 gi . n ′
Then
n
√ n g = √ 1 � g → N (0, Σ). n
i
i=1
8
d
{ }
≡
∑
Executive Summary of Sections 2.3-2.6
Large-Sample Distribution of the OLS Estimator yt =
′
xt
β + εt
(t = 1, 2, . . . , n)
(1×K ) (K ×1)
Assumptions:
• (ergodic stationarity) {y , x } is jointly stationary and ergodic. • (orthogonality conditions/predetermined regressors) E[x · (y − x β )] = 0 or equivalently E(g ) = 0 , g ≡ x · ε t
t
′
t
t
t
t
t
t
t
• (rank condition) The K × K matrix E(x x ) is nonsingular (and hence finite). • (mds) {g } is a martingale difference sequence ′
t
t
t
Comments on the assumptions
• mds is stronger than predetermined regressors. • A trivial but important special case of ergodic stationarity is the sample is a random sample ({y , x } is iid). • (A sufficient condition for {g } to be an m.d.s.) E(ε | ε , ε , . . . , ε , x , x , . . . , x ) = 0. t
t
t
t
t−1
t−2
1
t
t−1
1
[Prove on the board.]
• (When the regressors include a constant) – The orthogonality conditions can be stated in more familiar terms: the mean of the error term is zero (which is implied by E( xtk εt ) = 0 for k = 1), and the contemporaneous correlation between the error term and the regressors is zero (which is implied by E(xtk εt ) for k = 1 and E(εt ) = 0).
̸ – {ε } is mds. t
E(εt ε t 1 , εt 2 , . . . , ε1 ) = 0.
|
−
−
Hence no serial correlation in the error term. [Explain on the board.]
1
Proposition 2.1 (asymptotic distribution of the OLS Estimator) (a) (Consistency of b for β ) Under the ergodic stationarity, orthogonality conditions (predetermined regressors), and the rank condition, plimn b = β . →∞
(b) (Asymptotic Normality of b ) If the orthogonality conditions are strengthened as the mds condition, then
√ n(b − β) → N (0, Avar(b))
as n
d
→ ∞,
where Avar(b) = Σ xx1 S Σxx1 . −
( Σ xx
≡ E(x x ), S = E(g g ), g ≡ x · ε .) ′
t
t
′
t
t
t
t
t
[Prove on the board.]
2
−
(1)
Hypothesis Testing Under some additional condition, Avar(b) is consistently estimated by
�� � 1
−
Avar(b) = S xx
1
n
n
2
′
1
−
et x t xt Sxx ,
′
≡ y − x b,
et
t=1
t
1
Sxx
≡ n
t
n
�
′
xt xt .
t=1
=S
Therefore: (a) Under the null hypothesis H 0 : β k = β k ,
≡
tk
√ n(b − β )
√
k
k
Avar(bk )
=
bk
− β → N (0, 1) k
∗
SE
( bk )
with
d
(b) Under the null hypothesis H 0 : Rβ = r , where R is an #r of r , is the number of restrictions) of full row rank, ′
′
≡ n · (Rb − r) {R[Avar(b)]R }
W
1
−
∗
SE
(bk )
≡
�
1
n
Avar(bk )
× K matrix (where #r, the dimension
(Rb
2
− r) → χ (#r). d
[Explain on the board.] denominator in the t-ratio is called the heteroskedasticity-consistent standard error , robust standard error , etc. The t-ratio is called the robust t-ratio. Note: The
3
Implications of Conditional Homoskedasticity What is the relationship between the usual t and the robust t? Between F and the Wald statistic? 2
2
• (conditional homoskedasticity) E(ε | x ) = σ . • Implication for S = E(g g ) = E(x x ε ): S = σ Σ t
t
′
t
t
Accordingly, where s2
≡
∑
1 n−K
2
′
t
t
2
t
xx
.
S = s 2Sxx ,
n t=1
e2t is the OLS estimate of σ 2 .
without conditional homoskedasticity
with conditional homoskedasticity
b
Sxx1 sxy
Avar(b)
Σxx1 S Σxx1
1 σ 2 Σ− xx
S
E(ε2t xt xt )
σ 2 Σxx
−
−
∑ √ ·� � } · − { ≡ ∑ ≡ ∑ Sxx1 S Sxx1 −
n
SE k (standard
error)
W for H 0 : Rβ = r
Σxx
1
n
n (Rb
≡ E(x x ), S ′
t
t
xx
1
n
t=1
n
t=1
· ·
e2t xt x′t
Avar(b)
s2 Sxx
√ ·� 1
(xt xt ), sxy ′
4
′
1
n
1
−
t=1
′
− r) x · y , e ≡ y − x b, ′
t
s2 (X′ X)−1 )kk
−1
′
(Rb−r) R(X X)
(Rb
n
√ ·� � �
1 s2 S− xx )kk =
n
kk
r) [R[Avar(b)]R ′
′ −1 1 2 s2 S− xx = n s (X X)
−
n
1
S
Note:
same
′
Avar(b)
∗
−
t
t
t
t
s
σ2
−1
R′
(Rb−r)
2
2
≡ E(ε ). t
2.10
Testing for Serial Correlation
Box-Pierce and Ljung-Box Suppose we have a sample of size n, z 1 , . . . , zn , drawn from a scalar covariance-stationary process.
{
}
• The sample j -th order autocovariance is n
γ ≡ j
�
1 (z t n t= j+1
− z )(z − − z ) n
t j
( j = 0, 1, . . .), where z n
≡
n
1 n
n
� z . t
t=1
• The sample j -th order autocorrelation coefficient , ρ , is defined as j
ρ ≡ γγ
j
j
( j = 1, 2, . . .).
0
• If {z } is ergodic stationary, then it is easy to show that γ is consistent for γ ( j = 0, 1, 2, . . .) and ρ is consistent for ρ ( j = 1, 2, . . .). For testing purposes, we need to know the asymptotic √ distribution of n ρ. t
j
j
j
j
j
Proposition 2.9 Suppose z t can be written as µ +εt , where ε t is a stationary martingale difference sequence with “own” conditional homoskedasticity:
{ }
(own conditional homoskedasticity) E(ε2t ε t−1 , εt−2 , . . .) = σ 2 , σ 2 > 0.
|
Then
√ n γ → N (0, σ I ) 4
d
p
and
where γ = (γ1 , γ2 , . . . , γ p )′ and ρ = (ρ1 , ρ2 , . . . , ρ p )′ .
√ n ρ → N (0, I ), p
d
From this, easy to show:
• √ n ρ = √ → N (0, 1). 1/√ n is ”standard error”. • Box-Pierce Q statistic ≡ n ∑ ρ = ∑ �√ n ρ � → χ ( p). • Ljung-Box Q statistic ≡ n · (n + 2) ∑ − = ∑ − �√ n ρ � → χ ( p). ρ1
1
1/
n
d
p 2 j =1 j
p j =1
ρ2j p j =1 n j
1
2
j
d
p n+2 j =1 n j
2
2
j
d
2
Sample Autocorrelations Calculated from Residuals Proposition 2.10 (testing for serial correlation with predetermined regressors) Suppose that ergodic stationarity, the rank condition, and (stronger form of predeterminedness) E(εt ε t−1 , εt−2 , . . . , xt , xt−1 , . . .) = 0, (stronger form of conditional homoskedasticity) E(ε2t ε t−1 , εt−2 , . . . , xt , xt−1, . . .) = σ 2 > 0.
|
|
are satisfied. Then, nR2 from the following auxiliary regression regress et on xt , et−1 , et−2 , . . . , et− p . is asymptotically χ2 ( p). run this auxiliary regression for t = 1, 2, . . . , n, we need data on (e0 , e−1 , . . . , e− p+1 ). It does not matter asymptotically which particular numbers to assign to them, but it seems sensible to set them equal to 0, their expected value.
• To
2
is called the Breusch-Godfrey test for serial correlation . When p = 1, the test statistic is asymptotically equivalent to the square of what is known as Durbin’s h statistic.
• The test based on nR
2
2.11
Application: Rational Expectations Econometrics
“An efficient capital market is a market that is efficient in processing information. In an efficient market, prices ‘fully reflect’ available information” Fama (1976, p. 133).
The Efficient Market Hypotheses for the U.S. T-Bill Market
• Notation. v ≡ price of a one-month Treasury bill at the beginning of month t R ≡ one-month nominal interest rate over month t, i.e., nominal return on the bill over the month = (1 − v )/v , so v = 1/(1 + R ), P ≡ CPI at the beginning of month t, π ≡ inflation rate over month t (i.e., from the beginning of month t to the beginning of month t + 1) = (P − P )/P , π ≡ expected inflation rate over month t, expectation formed at the beginning of month t, η ≡ inflation forecast error = π − π , − − 1 ≈ R − π , r ≡ ex-post real interest rate over month t = = − 1 ≈ R − π . r ≡ ex-ante real interest rate = t
t
t
t
t
t+1
t
t
t
t+1
t
t t+1 t+1
t+1
t t+1
1/P t+1
vt /P t vt /P t
t+1
1+Rt 1+t πt+1
t t+1
t
1+Rt 1+πt+1
t
t+1
t t+1
Figure 2.2: What’s Observed When
• The efficient market hypothesis is a joint hypothesis combining: Rational Expectations. Inflationary expectations are rational: t πt+1 = E(πt+1 I t ), where I t is information available at the beginning of month t and includes Rt , Rt−1, . . . , πt , πt−1 , . . . . Also, I t I t−1 I t−2 . . .. That is, agents participating in the market do not forget.
⊇
⊇
{
Constant Real Rates. The ex-ante real interest rate is constant: t rt+1 = r.
|
}
The Data Monthly data on the one-month T -bill rate and the monthly CPI inflation rate, stated in percent at annual rates. To duplicate Fama’s results, we take the sample period to be the same as in Fama (1975), which is January 1953 through July 1971. The sample size is thus 223.
Figure 2.3: Inflation and Interest Rates
Figure 2.4: Real Interest Rates
4
Testable Implications Implication 1: The ex-post real interest rate has a constant mean and is serially uncorrelated. More precisely, rt+1 = r ηt+1 and ηt is mds.
− { } | I ) = −r + R .
Implication 2: E(πt+1
t
t
Testing for Serial Correlation
• Use Proposition 2.9. Calculate ρ
j
and Q as
n
n
� z , ρ ≡ γ ( j = 1, 2, . . .). γ ≡ − z )(z − − z ), z ≡ γ ∑ − = ∑ − �√ n ρ � → χ ( p) Ljung-Box Q statistic ≡ n · (n + 2) j
�
1 (z t n t= j+1
n
t j
n
1 n
n
ρ2j p j =1 n j
t
0
t=1
p n+2 j =1 n j
j
j
2
j
d
2
• Results. Table 2.1.
Table 2.1: Real Interest Rates, January 1953–July 1971
mean = 0.82%, standard deviation = 2.847%, sample size = 223
ρ j
std. error Ljung-Box Q p-value (%)
j = 1
j = 2
−0.101
0.172
j = 3
j = 4
j = 7
j = 8
j = 9
j = 10
j = 11
j = 12
−0.019 −0.004 −0.064 −0.021 −0.092
0.095
0.094
0.019
0.004
0.207
0.067
0.067
0.067
0.067
0.067
0.067 0.067
0.067
0.067
2.3
9.1
9.1
2.8%
5.8%
9.1
12.8% 1.1%
j = 5
0.067 10.1
j = 6
0.067 10.2
7.3% 11.7%
0.067 12.1 9.6%
14.2 7.6%
16.3 6.1%
16.4 8.9%
16.4 12.8%
26.5 0.9%
Is the Nominal Interest Rate the Optimal Predictor?
• Want to test the second implication E(πt+1 I t ) = constant + Rt .
|
( )
∗
−r is replaced by “constant” to emphasize that there is no restriction on the intercept. • y ≡ π , x ≡ (1, R )′ in ′ Here, t
t+1
t
t
yt = x t β + εt .
H 0 : β 2 = 1 (the Rt coefficient is unity). Then E(εt I t ) = 0 (actually, εt = η t+1) because:
|
≡ y − x′ β = π − β − β R (by our choice of (y , x )) = π − β − R (by our null hypothesis that R coefficient is unity) = π − β − [E(π | I ) − β ] (by (*) and if we call the unknown constant in (*) β ) = π − E(π | I ). – The orthogonality conditions ⇐ the mds assumption ⇐ E(ε | ε − , ε − , . . . , x , x − , . . . ) = 0.
εt
t
t
t+1
1
2
t+1
1
t
t+1
1
t+1
t
t
t
t
t+1
t+1
t
1
1
t
t
t 1
t 2
t
t 1
– The rank condition. E(xt x′t ) here is
1 E(x x′ ) = E(R ) t t
t
E(R ) . E(R ) t
2
t
Its determinant is E(Rt2) [E(Rt )]2 = Var(Rt ), which must be positive; if the variance were zero, we would not observe the interest rate Rt fluctuating.
−
– Ergodic stationarity? See Figure 2.3.
• The estimated regression is, for t = 1/53, . . . , 7/71, πt+1 =
−0.868
+ 1.015 Rt ,
(0.431)
(0.112)
R2 = 0.24, mean of dependent variable = 2.35%, SER = 2.84%, n = 223. The heteroskedasticity-robust standard errors in parentheses.
• Another choice of (y , x ) and H is the following. y = π , x = (1, R , z ) with z ∈ I . t
t
t+1
t
t
t
0
t
t
t
H 0 : Rt coefficient is unity and z t coefficient is zero.
6
6.1 Modeling Serial Correlation: Linear Processes • Chapter 2 introduced the following concepts. Definition: A stochastic process {z t } (t = 1, 2, . . .) is (strictly) stationary if, for any given finite
integer r and for any set of subscripts, t1 , t2, . . . , tr , the joint distribution of (z t , z t , z t , . . . , zt ) depends only on t1 − t, t2 − t, t3 − t , . . . , t r − t but not on t. 1
2
r
Definition: A stochastic process {z t } is weakly (or covariance) stationary if:
(i) E(z t ) does not depend on t, and (ii) Cov(z t , z t j ) ( j = 0, 1, 2, . . . ) exists, is finite, and depends only on j but not on t (for example, Cov(z 1 , z 5 ) equals Cov(z 12 , z 16 )). −
Definition: For a covariance-stationary process, the j -th order autocovariance , denoted
γ j , is defined as γ j ≡ Cov(z t , z t j ) = E[(yt − µ)(yt
− j
−
− µ)] ( j = 0, 1, 2, . . .).
By definition, γ j satisfies γ j = γ j . The 0-th order autocovariance is the variance γ 0 = Var(z t ). −
Definition: For a covariance-stationary process, the j -th order autocorrelation coeffi-
cient, denoted ρ j , is defined as ρ j ≡ γ j /γ 0 . Definition: A covariance-stationary process {εt } is white noise if E(εt ) = 0 and Cov(εt , εt− j ) =
̸ 0. 0 for j = Definition: An independent white noise process is an i.i.d. sequence with mean zero and
finite variance
• This Chapter introduces 1. A very important class of covariance-stationary processes, called linear processes, which can be created by taking a moving average of a white noise process. Useful for describing serially correlated processes. 2. An apparatus called the filter – useful for describing linear processes. 3. Provide a LLN and a CLT for linear processes and extend Billingsley’s CLT to incorporate serial correlation. 4. Extend GMM to the case where {gt } is serially correlated.
1
MA(q ) • A process {yt } is called the q -th order moving average process (MA(q )) if it can be written as a weighted average of the current and most recent q values of a white noise process: yt = µ + θ0 εt + θ1 εt 1 + · · · + θq εt −
−q
with θ0 = 1.
(6.1.1)
• Covariance-stationary? – Ok with mean µ. – Autocovariances? The j-th order autocovariance, γ j (≡ E[(yt − µ)(yt
− j
q j �
− µ)]), is
−
γ j = (θ j θ0 + θ j +1θ1 + · · · + θq θq j )σ 2 = σ 2 −
θ j +k θk
for j = 0, 1, . . . , q , (6.1.2a)
k=0
γ j = 0 for j > q,
(6.1.2b)
where σ 2 ≡ E(ε2t ). (This formula also covers γ j because γ j = γ j .) So MA(q ) is covariance-stationary. −
2
−
MA(∞) as a Mean Square Limit • A natural generalization of MA(q ) is MA(∞). A sequence of real numbers {ψ j } is absolutely ∑ summable if j =0 |ψ j | < ∞. ∞
• Let z tn ≡ {ψ j },
∑n
ψ j εt j . Analytical Exercise 1 and 2 show that, under absolute summability on
j =0
−
We use the notation
∑
∞
z tn → z t as n → ∞ for each t. m.s.
ψ j εt j for the mean-square limit z t .
j =0
−
Proposition 6.1 (MA(∞) with absolutely summable coefficients) Let {εt } be white noise and {ψ j } be a sequence of real numbers that is absolutely summable. Then (a) For each t,
� ∞
yt = µ +
ψ j εt
− j
(6.1.6)
j =0
converges in mean square. {yt } is covariance-stationary. (The process {yt } is called the infiniteorder moving-average process (MA(∞)).) (b) The mean of yt is µ. The autocovariances { γ j } are given by
� ∞
γ j = (ψ j ψ0 + ψ j +1ψ1 + ψ j +2ψ2 + · · · )σ2 = σ 2
ψ j +k ψk
( j = 0, 1, 2, . . . ).
(6.1.7)
k=0
(c) The autocovariances are absolutely summable:
∑
∞
j =0
|γ j | < ∞.
(d) If, in addition, {εt } is i.i.d, then the process {yt } is (strictly) stationary and ergodic.
• This result includes MA(q ) processes as a special case with ψ j = θ j for j = 0, 1, . . . , q and ψ j = 0 for j > q . • Compare (6.1.7) with (6.1.2). (6.1.7) is the limit of (6.1.2). Proposition 6.2 (Filtering covariance-stationary processes) Let {xt } be a covariance-stationary process and {h j } be a sequence of real numbers that is absolutely summable. Then (a) For each t, the infinite sum
� ∞
yt =
h j xt
− j
(6.1.9)
j =0
converges in mean square. The process {yt } is covariance-stationary. (b) If, furthermore, the autocovariances of {xt } are absolutely summable, then so are the autocovariances of {yt }.
3
Filters • Define the lag operator L by the relation L j xt = x t j . Then the weighted average can be written as α(L)xt where −
α(L) ≡ α0 + α1 L + α2 L2 + · · · .
∑
∞
α j xt
j =0
− j
(6.1.10)
This power series in L, α(L), is called a filter.
• If α j = ̸ 0 for j = p and α j = 0 for j > p, the filter reduces to a p-th degree lag polynomial or a p-th order lag polynomial or a lag polynomial of order p: α(L) = α 0 + α1 L + · · · + α p L p .
• Let α(L) and β (L) be two filters. Define the product δ (L) = α(L)β (L) by the convolution: α0 β 0 = δ 0 , α0 β 1 + α1 β 0 = δ 1 , α0 β 2 + α1 β 1 + α2 β 0 = δ 2 , ··· , α0 β j + α1 β j 1 + α2 β j 2 + · · · + α j 1 β 1 + α j β 0 = δ j , ··· . −
−
−
(6.1.12)
• Why such a complicated formula? Because if δ j is the coefficient of z j in α(z )β (z ). Illustrate. • By definition, the product is commutative: α(L) β (L) = β (L) α(L). Won’t generalize to the multivariate case. • If α(L) and β (L) are absolutely summable, δ (L) = α(L) β (L), and {xt } is covariance-stationary, then α(L) [β (L)xt ] = δ (L)xt . (6.1.13) (It can be shown that δ (L) is absolutely summable.)
• The inverse of α(L), denoted α(L) −1
α(L)
−1
or 1/α(L), is defined by −1
is a filter satisfying α(L) α(L)
= 1.
(6.1.14)
As long as α 0̸ = 0 in α(L) = α 0 + α1 L + · · · , the inverse of α(L) can be defined for any arbitrary sequence {α j }. To see this, just set δ 0 = 1, δ j = 0 for j ≥ 1 in the convolution. We can calculate the coefficients in the inverse filter recursively .
• Example: invert the filter 1 − L. • Easy to prove α(L) α(L) 1 = α(L) 1 α(L) (so inverses are commutative), “α(L) β (L) = δ (L)” ⇔ “β (L) = α(L) 1δ (L)” ⇔ “α(L) = δ (L) β (L) 1 ,” −
−
−
−
provided α0 = ̸ 0 and β 0 = ̸ 0. (See Review Question 3 for proof of (6.1.15b).) 4
(6.1.15a) (6.1.15b)
Inverting Lag Polynomials Important special case: invert a p-th degree lag polynomial φ(L) φ(L) = 1 − φ1 L − φ2 L2 − · · · − φ p L p . Since φ0 = 1 ̸ = 0, the inverse can be defined. Let ψ(L) ≡ φ(L) 1 , so φ(L) ψ(L) = 1. −
• By definition, ψ(L) satisfies constant: L: L2 : ··· p 1 L : L p : L p+1 : L p+2 : −
ψ0 ψ1 ψ2
− φ1 ψ0 − φ1 ψ1
= 1, = 0, = 0,
− φ2 ψ0
··· ψ p 1 − φ1 ψ p 2 − φ2 ψ p 3 − · · · − φ p ψ p − φ1 ψ p 1 − φ2 ψ p 2 − · · · − φ p ψ p+1 − φ1 ψ p − φ2 ψ p 1 − · · · − φ p ψ p+2 − φ1 ψ p+1 − φ2 ψ p − · · · − φ p ··· −
ψ0 = 0, 1 ψ1 − φ p ψ0 = 0, 1 ψ2 − φ p ψ1 = 0, 1 ψ3 − φ p ψ2 = 0, .
−
−
−1
−
−
−
−
− −
(6.1.16)
• For sufficiently large j (actually for j ≥ p), {ψ j } follows the p-th order homogeneous difference equation ψ j − φ1 ψ j
−1
− φ2 ψ j
−2
− · · · − φ p 1 ψ j −
− p+1
− φ p ψ j p = 0. ( j = p, p + 1, p + 2, . . . ) −
(6.1.17)
• Fact from Theory of Difference Equations: The solution sequence {ψ j } to (6.1.17) eventually starts declining at a geometric rate if what is known as the stability condition holds. The condition states: All the roots of the p-th degree polynomial equation in z φ(z ) = 0 where φ(z ) ≡ 1 − φ1 z − φ2 z 2 − · · · − φ p z p
(6.1.18)
are greater than 1 in absolute value. The polynomial equation φ(z ) = 0 is called the characteristic equation .
Proposition 6.3 (Absolutely summable inverses of lag polynomials) Consider a p-th degree lag polynomial φ(L) = 1 − φ1 L − φ2 L2 − · · · − φ p L p , and let ψ(L) ≡ φ(L) 1 . If the associated p-th degree polynomial equation φ(z ) = 0 satisfies the stability condition (6.1.18), then the coefficient sequence {ψ j } of ψ(L) is bounded in absolute value by a geometrically declining sequence and hence is absolutely summable. −
5
6.2
ARMA Processes
AR(1) and Its MA(∞) Representation • Several ways to write AR(1). yt = c + φyt 1 + εt −
or yt − φyt
−1
= c + εt or (1 − φL)yt = c + εt ,
(6.2.1)
(6.2.1 )
where {εt } is white noise. If φ ̸ = 1, let µ ≡ c/(1 − φ) and rewrite this equation as (yt − µ) − φ · (yt
−1
− µ) = ε t or (1 − φL)(yt − µ) = ε t .
′
• (6.2.1) + covariance stationarity pins down yt . 1. |φ| < 1, i.e., 1 − φz satisfies the stability condition.
� ∞
−1
2
2
yt − µ = (1 − φL) εt = (1 + φL + φ L + · · · )εt =
�
φ j εt
− j
j =0
∞
or yt = µ +
φ j εt j .
−
(6.2.2)
j =0
2. |φ| > 1.
� ∞
yt = µ −
φ j εt+ j . −
(6.2.4)
j =1
3. |φ| = 1. There is no cov-stationary process satisfying the AR(1) equation.
• Autocovariance of AR(1). Use the MA(∞) to obtain j
j +2
γ j = (φ + φ
j +4
+ φ
φ j σ 2 + · · · )σ = 1 − φ2
and ρ j (≡ γ j /γ 0 ) = φ j
2
Alternative method: “Yule-Walker equation” (Analytical Exercise 5).
6
( j = 0, 1, . . .). (6.2.5)
AR( p) and Its MA(∞) Representation • Various ways to write AR( p). yt = c + φ1yt 1 + · · · + φ p yt p + εt or yt − φ1 yt 1 − · · · − φ p yt p = c + εt or φ(L)yt = c + εt with φ(L) = 1 − φ1 L − φ2 L2 − · · · − φ p L p , −
−
−
−
(6.2.6)
where φ p = = 0, let µ ≡ c/(1 − φ1 − · · · − φ p ) = c/φ(1). ̸ 0. If φ(1) ̸ (yt − µ) − φ1 · (yt 1 − µ) − · · · − φ p · (yt or φ(L)(yt − µ) = ε t .
− p
−
− µ) = ε t
(6.2.6 ) ′
The generalization to AR( p) of what we have derived for AR(1) is
Proposition 6.4 (AR( p) as MA(∞) with absolutely summable coefficients) Suppose the p-th degree polynomial φ(z ) satisfies the stationarity (stability) condition (6.1.18) or (6.1.18 ). Then ′
(a) The unique covariance-stationary solution to the p-th order stochastic difference equation (6.2.6) or (6.2.6 ) has the MA(∞) representation ′
yt = µ + ψ(L)εt , ψ(L) = ψ 0 + ψ1 L + ψ2L2 + ψ3L3 + · · · ,
(6.2.7)
where ψ(L) = φ(L) 1 . The coefficient sequence {ψ j } is bounded in absolute value by a geometrically declining sequence and hence is absolutely summable. −
(b) The mean µ of the process is given by µ = φ(1) 1 c where c is the constant in (6.2.6). −
(6.2.8)
(c) {γ j } is bounded in absolute value by a sequence that declines geometrically with j. Hence, the autocovariances are absolutely summable.
• Prove (a), (b), and (c). • Autocovariances. Use either the MA(∞) representation or the Yule-Walker eq.
7
ARMA( p, q ) An ARMA( p, q ) process combines AR( p) and MA(q ): φ(L)yt = c + θ(L)εt , φ(L) = 1 − φ1L − · · · − φ p L p , θ(L) = θ 0 + θ1 L + · · · + θq Lq ,
(6.2.9)
where {εt } is white noise. If φ(1) ̸ = 0, set µ = c/φ(1). The deviation-from-the-mean form is φ(L)(yt − µ) = θ(L)εt .
(6.2.9 ) ′
Proposition 6.5 (ARMA( p, q ) as MA(∞) with absolutely summable coefficients) Suppose the p-th degree polynomial φ(z ) satisfies stationarity (stability) condition (6.1.18) or (6.1.18 ). Then ′
(a) The unique covariance-stationary solution to the p-th order stochastic difference equation (6.2.9) or (6.2.9 ) has the MA(∞) representation ′
yt = µ + ψ(L)εt , ψ(L) = ψ 0 + ψ1 L + ψ2 L2 + ψ3 L3 + · · · ,
(6.2.10)
where ψ(L) ≡ φ(L) 1 θ(L). The coefficient sequence {ψ j } is bounded in absolute value by a geometrically declining sequence and hence is absolutely summable. −
(b) The mean µ of the process is given by µ = φ(1) 1 c where c is the constant in (6.2.9). −
(6.2.11)
(c) {γ j } is bounded in absolute value by a sequence that declines geometrically with j. Hence, the autocovariances are absolutely summable.
• Prove (a), (b), and (c). • Autocovariances. Use either the MA(∞) representation or the Yule-Walker eq. • ARMA( p, q ) with Common Roots. For reasons of parsimony, ARMA equations with common roots are rarely used to parameterize covariance-stationary processes. Provide example. • Invertibility. ARMA( p, q ) is said to satisfy invertibility condition if θ(z ) satisfies the stability condition (that the roots of θ(z ) = 0 lie outside the unit circle). Under this condition, ARMA( p, q ) can be represented as an AR(∞): θ(L) 1 φ(L)yt = −
8
c + εt . θ(1)
(6.2.13)
6.3
Vector Processes
• A vector white noise process {εt } is a jointly covariance-stationary process satisfying ′
′
E(εt ) = 0 , E(εt εt ) = Ω (positive definite),
E(εt εt j ) = 0 −
for j = ̸ 0.
(6.3.1)
• A vector MA(∞) process is the obvious vector version of (6.1.6):
� ∞
yt = µ +
Ψ j εt
− j
with Ψ0 = I .
(6.3.2)
j =0
• The sequence of coefficient matrices is said to be absolutely summable if each element is absolutely summable. That is,
� ∞
“{Ψ j } is absolutely summable” ⇔ “
|ψkℓj | < ∞ for all (k, ℓ).”
(6.3.3)
j =0
• Proposition 6.1 generalizes to the multivariate case in an obvious way. In particular, if Γ j (≡ E[(yt − µ)(yt j − µ) ]) is the j-th order autocovariance matrix, then the expression for autocovariances in part (b) of Proposition 6.1 becomes ′
−
� ∞
Γ j =
′
Ψ j +k Ω Ψk
( j = 0, 1, 2, . . .).
(6.3.4)
k=0
• A multivariate filter can be written as H(L) = H 0 + H1 L + H2 L2 + · · · , where {H j } is a sequence of (not necessarily square) matrices.
• The multivariate version of Proposition 6.1 is obvious, with yt = H (L) xt . • Let A(L) and B(L) be two filters where {A j } is m × r and {B j } is r × s so that the matrix product A j Bk can be defined. The product of two filters, D(L) = A(L) B(L) is an m × s filter whose coefficient matrix sequence {D j } is given by the multivariate version of the convolution formula (6.1.12): A0 B0 = D 0 , A0 B1 + A1 B0 = D 1, A0 B2 + A1 B1 + A2 B0 = D 2 , ··· , A0 B j + A1 B j 1 + A2 B j 2 + · · · + A j 1 B1 + A j B0 = D j , ··· . −
−
−
(6.3.5)
• Let A (L) and B(L) be two filters whose coefficient matrices are square. B(L) is said to be the inverse of A(L) and is denoted A(L) 1 if −
A(L) B(L) = I . • Inverses are commutative. 9
(6.3.6)
• A p-th degree lag matrix polynomial is Φ(L) = I r − Φ1 L − Φ2 L2 − · · · − Φ p L p ,
(6.3.7)
̸ 0 . where {Φ j } is a sequence of r × r matrices with Φ p = • The stability condition is All the roots of the determinantal equation
|I − Φ1 z − · · · − Φ p z p | = 0
(6.3.8)
are greater than 1 in absolute value (i.e., lie outside the unit circle).
• 2 by 2 Example. • With the stability condition thus generalized, Proposition 6.3 generalizes in an obvious way: Let Ψ (L) = Φ (L) 1 . Each component of the coefficient matrix sequence {Ψ j } will be bounded in absolute value by a geometrically declining sequence. −
• The multivariate analogue of an AR( p) is a vector autoregressive process of p-th order (VAR( p)). It is the unique covariance-stationary solution under stationarity condition (6.3.8) to the following vector stochastic difference equation: yt − Φ1 yt 1 − · · · − Φ p yt p = c + εt or Φ(L)(yt − µ) = ε t where Φ(L) = I r − Φ1 L − Φ2 L2 − · · · − Φ p L p and µ = Φ (1) 1 c, −
−
−
(6.3.9)
where Φ p = ̸ 0 .
• Bivariate example. • Proposition 6.4 generalizes straightforwardly to the multivariate case. In particular, each element of Γ j is bounded in absolute value by a geometrically declining sequence.
10
6.4 Estimating Autoregressions Estimation of AR(1) Want to estimate the AR(1) equation. yt = c + φyt−1 + εt
(6.2.1)
Assume that εt is independent white noise (an i.i.d. sequence with zero mean and finite variance). Letting xt = (1, yt−1 )′ and β = (c, φ)′ , the AR(1) equation (6.2.1) can be written as a regression equation (6.4.1) yt = x ′t β + εt .
{ }
We now show that all the conditions of Proposition 2.5 about the asymptotic properties of the OLS estimator of β with conditional homoskedasticity are satisfied here.
• Ergodic stationarity? Proposition 6.1(d). • Conditional homoskedasticity?
E(ε2t xt ) = σ 2 ,
|
(6.4.2)
• g ≡ x · ε mds? Yes. • Rank condition? It is satisfied because the determinant of t
t
t
E(xt x′t ) = is γ 0 > 0.
1
�
1
µ
µ γ 0 + µ2
�
(6.4.4)
Estimation of AR( p) yt = c + φ1 yt−1 +
··· + φ y − + ε . p t p
t
(6.2.6)
All the roots of the p-th degree polynomial equation in z φ(z ) = 0 where φ(z )
2
p
≡ 1 − φ z − φ z − · · · − φ z 1
2
p
(6.1.18)
are greater than 1 in absolute value.
Proposition 6.7 (Estimation of AR coefficients): Let yt be the AR( p) process following (6.2.6) with the stationarity condition (6.1.18). Suppose further that εt is independent white noise. Then the OLS estimator of (c, φ1 , φ2, . . . , φ p ) obtained by regressing yt on the constant and p lagged values of y for the sample period of t = 1, 2, . . . , n is consistent and asymptotically normal. Letting β = (c, φ1 , φ2 , . . . , φ p )′ and xt = (1, yt−1 , . . . , yt− p )′ , and β = OLS estimate of β , we have
{ }
{ }
Avar(β ) = σ 2 E(xt x′t )−1 , which is consistently estimated by
Avar(β ) = s
2
�· � � n
1
n
−1 , xt x′t
t=1
where s2 is the OLS estimate of the error variance given by 1
2
s =
n
− p − 1
n
� − − (yt
c
t=1
φ1 yt−1
2
−···−φ y − ) . p t p
(6.4.5)
• The only difficult part is to verify the rank condition that E(x x′ ) is nonsingular (Assumption 2.4). This condition is equivalent to requiring the p × p autocovariance matrix Var(y , . . . , y − ) t t
t
t p+1
to be nonsingular (see Review Question 1(b)). (More generally, it can be shown that the autocovariance matrix of a covariance-stationary process is nonsingular for any p if γ 0 > 0 and 0 as j .) γ j
→
→ ∞
2
Estimation of VARs yt Φ1 yt−1 where Φ(L) = I r
− · · · − Φ y − = c + ε or Φ(L)(y − µ) = ε − Φ L − Φ L − · · · − Φ L and µ = Φ(1)− c,
−
p t p
1
2
t
t
2
t
p
p
1
(6.3.9)
This can be written as ytm = x ′t δ m + εtm
(m = 1, 2, . . . , M ) ,
(6.4.9)
where ytm is the m-th element of yt , εtm is the m-th element of εt, and
cm 1 yt−1 φ1m ′ = m -th row of Φ . = yt−2 , δ m = φ2m , cm = m -th element of c, φ jm xt j .. .. (Mp+1)×1 . . yt− p φ pm
So the M equations share the same set of regressors.
• (It can be shown that) the equation-by-equation OLS is the efficient estimation procedure. – The Avar:
Avar(δ) = Ω
⊗
� � ��
−1 E xt x′t ,
Ω (M ×M )
≡ E(ε ε′ ). t t
– A consistent estimate of the Avar:
⊗ � � �
Avar(δ ) = Ω
1
n
n
−1 , xt x′t
(6.4.10)
t=1
where
≡ Ω
1 n
− Mp − 1
n
� − − ′ εt εt , εt = y t
t=1
3
c
Φ1yt−1
−···−Φ y− . p t p
(6.4.11)
6.5 Asymptotics for Sample Means of Serially Correlated Processes This section studies the asymptotic properties of the sample mean y
1
≡ n
n
�
yt
t=1
for serially correlated processes. Chapter 2 did
• Chebychev’s LLN: y → µ if lim E(y) = µ, lim Var(y) = 0. • Kolmogorov’s LLN: y → µ if {y } is iid and E(y ) = µ. • Ergodic Theorem (generalization of Kolmogorov): y → µ if {y } is ergodic stationary and E(y ) = µ. • Linderberg-Levy CLT: √ n(y − µ) = √ (y − µ) → N (0, σ ) if {y } is iid wtih E( y ) = µ m.s.
a.s.
t
t
a.s.
t
t
1 n
2
and Var(yt ) = σ .
∑
n t=1
t
• Billingsley’s CLT (for ergodic stationary mds): √ ny = √ stationary mds wtih
E(yt2 )
1 n
2
= σ .
This section does
• LLN for cov. stationary processes. • CLT for linear processes. • Gordin’s CLT (generalization of Billingsley)
4
2
d
∑
n t=1 yt
t
→
d
t
N (0, σ 2 ) if yt is ergodic
{ }
LLN for Covariance-Stationary Processes Proposition 6.8 (LLN for covariance-stationary processes with vanishing autocovariances): Let yt be covariance-stationary with mean µ and γ j be the autocovariances of yt . Then
{ }
(a) y (b)
→
m.s. µ
as n
{ }
{ }
→ ∞ if lim →∞ γ = 0. j
j
� ∞
√ lim Var( n y ) =
n→∞
γ j <
j=−∞
∞
if γ j is summable.
(6.5.1)
• To prove (a), enough to show that Var y → 0 as n → ∞ because of Chebychev. Var(y ) = =
1 n2
1
[Cov(y1, y1 +
··· + y ) + · ·· + Cov(y , y + ·· · + y )] n
1
n
··· + γ − + γ − ) + ( γ + γ + ·· · + γ − ) + ··· + (γ − + γ − + · ·· + γ + γ )] 1 = [ nγ + 2( n − 1)γ + ··· + 2( n − j )γ + ·· · + 2γ − ] n n2 2
=
1 n
[( γ 0 + γ 1 +
n
n 2
0
γ 0 + 2
n 1
1
1
1 n
0
n 2
j
n 1
n 2
1
n 1
�� − � n−1
j=1
1
j γ j . n
The rest of proof is Analytical Exercise 9.
• For a covariance-stationary process {y }, we define the long-run variance to be the limit as √ ∞ n → ∞ of Var( n y ) (if it exists). So (b) says that the long-run variance of y is −∞ γ < ∞. • Calculate Var(√ n y). − √ j Var( n y ) = n · Var(y ) = γ + 2 1− (6.5.2) γ . t
t
�� � n 1
0
j=1
5
n
j
∑
j=
j
0
Two Central Limit Theorems We present two CLTs to cover serial correlation. The first is a generalization of Lindeberg-Levy.
∑ � � �
Proposition 6.9 (CLT for MA( )): Let yt = µ + ∞ ψ < . Then noise and j=0 j
∑
| | ∞
∞
√ n (y − µ) → N 0, d
• We will not prove this result.
6
∞
j=0 ψ j εt− j where
∞
j=−∞
γ j .
{ε } is independent white t
(6.5.3)
Proposition 6.10 (Gordin’s CLT for zero-mean ergodic stationary processes): Suppose yt is stationary and ergodic and suppose Gordin’s condition is satisfied. Then E(yt ) = 0, the autocovariances γ j are absolutely summable, and
{ }
{ }
� � �
√ n y → N d
∞
0,
γ j .
j=−∞
• Does not assume MA(∞). • Generalizes Billingsley. Gordin’s condition on ergodic stationary processes (a) E(yt2) < . (This is a restriction on [strictly] stationary processes because, strictly speaking, a stationary process might not have finite second moments.)
∞
(b) E(yt yt− j , yt− j −1 , . . .) 0 as j . ∞ 2 1/2 (c) [E(rtj )] < , where rtj E(yt I t− j )
�
|
→
m.s.
∞
j=0
→ ∞ ≡ | − E(y | I − − ) and I ≡ (y , y − , y − , . . .). t
t j 1
t
t
t 1
t 2
• (b) implies E(y ) = 0. • (b) also implies y = ∞ r . This sum is called the telescoping sum. Proof: y = y − [E(y | I − ) − E(y | I − )] − [E(y | I − ) − E(y | I − )] − · · · − [E(y | I − ) − E(y | I − )] = [y − E(y | I − )] + [E(y | I − ) − E(y | I − )] + ··· + [E(y | I − ) − E(y | I − )] + E(y | I − ) = (r + r + · ·· + r − ) + E(y | y − , y − − , . . .). t
t
t
t
t
t0
t
t 1
t
t 1
t1
∑
j=0 tj t
t 1
t
t,j 1
t 1 t
t j
t
t 2
t
t 2
t
t 2
t
t
t j+1
t j
t
t j
t
t j
t
t j 1
• The telescoping sum indicates how the “shocks” represented by (r
t0 , rt1 , . . .)
influence the current value of y . Condition (c) says, roughly speaking, that shocks that occurred a long time ago do not have disproportionately large influence. As such, the condition restricts the extent of serial correlation in yt .
{ }
• Calculate the telescoping sum for AR(1).
7
t j
6.6 Incorporating Serial Correlation to OLS ′
yt = xt
β + εt
(t = 1, 2, . . . , n)
(1×K ) (K ×1)
Assumptions:
• (ergodic stationarity) {y , x } is jointly stationary and ergodic. • (orthogonality conditions/predetermined regressors) E[x · (y − x β )] = 0 or equivalently E(g ) = 0 , g ≡ x · ε t
t
′
t
t
t
t
t
t
t
• (rank condition) The K × K matrix E(x x ) is nonsingular (and hence finite). • (mds) {g } is a martingale difference sequence √ Recall: we used S for Avar(g) (i.e., the variance of the limiting distribution of n g, where g ≡ ′
t
t
t
1
n
∑
n i=1
gt). By ergodic stationarity, the mds assumption, and Billingsley CLT, S = E(gt gt ). ′
In this section, replace mds by
(Gordin’s condition restricting the degree of serial correlation) gt satisfies Gordin’s condition. Its long-run covariance matrix is nonsingular.
{ }
• By ergodic stationarity, Gordin’s condition, and (the multivariate extension of) Proposition 6.10 √ (Gordin’s CLT), n g converges to a normal distribution. S (≡ Avar(g)) equals the long-run covariance matrix,
� ∞
S =
� ∞
Γ j = Γ 0 +
j =−∞
′
(Γ j + Γ j ),
(6.6.1)
j =1
where Γ j is the j-th order autocovariance matrix ′
Γ j = E(gt gt j ) ( j = 0, 1, 2, . . .).
± ±
−
(6.6.2)
• The OLS remains consistent and asymptotically normal. Exactly the same expression for Avar and estimated Avar.
Avar(b) = Σ 1 S Σ 1, Avar(b) = S −
−
xx
xx
1
−
xx
S S 1. −
xx
This Avar(b) is called the HAC (heteroskedasticity and autocorrelation consistent) covariance estimator.
1
Estimating S When Autocovariances Vanish after Finite Lags
• Estimate Γ
j
by n
� ≡ ·
1 Γ j = gt gt n t= j +1
where
gt
′
− j
( j = 0, 1, . . . , n
− 1)
(6.6.3)
′
xt et , et
≡ y − x b. t
t
Would be consistent under an appropriate 4th moment assumption (which we will not bother to state).
• If we know a priori that Γ
j
= 0 for j > q where q is known and finite, then clearly S can be
consistently estimated by
q
q
� � ′
S = Γ0 +
(Γ j + Γ j ) =
j =1
Γ j (recall: Γ
j =−q
• What to do if we don’t know q (which may or may not be infinite)?
2
− j
′
= Γ j ).
(6.6.4)
Using Kernels to Estimate S A class of estimators, called the kernel-based (or “nonparametric”) estimators, can be expressed as a weighted average of estimated autocovariances:
� � � · n−1
j k q (n) n+1
S =
� � � · ( � n−1
Γ j . = k(0) Γ0 +
·
j =−
k
j =1
j q (n)
′
Γ j + Γ j .
(6.6.5)
Here, the function k( ) is called a kernel and q (n) is called the bandwidth. The bandwidth can depend on the sample size n.
·
• Truncated kernel. The estimator (6.6.4) above is a special kernel-based estimator with q (n) = q and 1 for |x| ≤ 1, k(x) = (6.6.6) 0 for |x| > 1.
{
We could use the truncated kernel with q (n) increasing with the sample size. Problem: not guaranteed to be positive semidefinite in finite samples.
• “Newey-West” . The Bartlett kernel : k(x) =
{ −| | x
1 0
3
for x 1, for x > 1.
| | ≤ | |
(6.6.7)
6.8 Application: Forward Exchange Rates as Optimal Predictors Bekaert-Hodrick Data
• We test the efficiency of FOREX using the same weekly data set used by Bekaert and Hodrick (1993). The data cover the period of 1975–1989 and include the following variables:
S t = spot exchange rate, stated in units of the foreign currency per dollar (e.g., 125 yen per dollar) on the Friday of week t, F t = 30-day forward exchange rate on Friday of week t, namely, the price in foreign currency units of the dollar deliverable 30 days from Friday of week t, S 30t = the spot rate on the delivery date on a 30-day forward contract made on Friday of week t.
• The maturity of the contract covers several sampling intervals; the delivery date is after the Friday of week t + 4 but before the Friday of week t + 5.
The Market Efficiency Hypothesis
≈ x) =⇒ where ε ≡ s30 − f .
Covered Interest Parity + RE + log approximation (that log(1 + x) E(s30t I t ) = f t or E(εt I t ) = 0
|
|
4
t
t
t
(6.8.3)
Testing Whether the Unconditional Mean Is Zero
• Taking the unconditional expectation of both sides of
(6.8.3), we obtain the unconditional
relationship:
E(εt ) = 0, where εt
≡ s30 − f .
E(s30t ) = E(f t ) or
t
t
• {ε } stationary? Figure 6.1. • Testing is not so trivial because {ε } is serially correlated. However, E(ε | ε , ε , . . .) = 0. It follows that Cov(ε , ε ) = 0 for j ≥ 5 but not necessarily for j ≤ 4. See Figure 6.2. • Check the conditions of Gordin’s CLT (Proposition 6.10). By this CLT, √ n ε → N (0, 1).
(6.8.4)
t
t
t
t
t−5
t−6
(6.8.5)
t− j
√ ∑ γ 0 + 2
4 j j =1 γ
d
(6.8.6)
Furthermore, (assuming an appropriate 4-th moment condition) the denominator is consistently estimated by replacing γ j by its sample counterpart γ j . So by Slutzky’s theorem (Lemma 2.4),
√ √ ∑ � ( � � nε
γ0 + 2
4 j =1 γ j
=
ε
(6.8.7)
4
γ0 + 2
γ j /n
j =1
is also asymptotically standard normal. The denominator of this expression will be called the standard error of the sample mean.
• Table 6.1. In no case is there significant evidence against the null hypothesis of zero mean.
5
Table 6.1: Means of Rates of Change, Forward Premiums, and Differences between the Two: 1975– 1989
Exchange rate = Y/$
DM/$
Means and Standard Deviations s30 s f s Difference (actual rate (expected rate (unexpected rate of change) of change) of change)
−
−
−4.98
(41.6)
−3.74
−1.78
−3.91
(40.6) £/$
Note:
3.59 (39.2) s30
(3.65)
(2.17) 2.16 (3.49)
−1.25
(42.4) standard error = 3.56 2.13 (41.1) standard error = 3.26 1.43 (39.9) standard error = 3.29
− s is the rate of change over 30 days and f − s is the
forward premium, expressed as annualized percentage changes. Standard deviations are in parentheses. The standard errors are the values of the denominator of (6.8.7). The data are weekly, from 1975 to 1989. The sample size is 778.
6
Regression Tests
• Another way to test (6.8.3) (that E(s30 |I ) = f ). Analogy from Fama: Look for H and (y , x ) t
t
0
t
t
t
such that under H 0 Assumptions 3.1-3.4, 3.5 are satisfied. Run the following regression s30t = β 0 + β 1 f t + εt .
(6.8.8)
f t is orthogonal to εt . Test whether β 0 = 0 and β 1 = 1.
• Problem: s30
and f t don’t seem stationary. Figure 6.3. If s30t f t is stationary but s30t and f t have “unit roots”, then the OLS estimate of β 1 approaches 1 even if f t is not orthogonal to εt . See Figure 6.4.
−
t
• So rewrite (6.8.3) as E(s30 − s |I ) = f − s and run the associated regression: s30 − s = β + β (f − s ) + ε . t
t
t
t
t
t
t
0
1
t
t
(6.8.9)
t
• Set H : β = 0, β = 1. Check Assumptions under H . 0
0
1
0
– (orthogonality conditions): The error term ε t equals the forecast error s30t orthogonality conditions E(εt ) = 0 and E[(f t st ) εt ] = 0 are satisfied.
− f and so the t
− · – (rank condition): The second moment of (1, f − s ) be nonsingular, which is true if and only if Var(f − s ) > 0. t
t
t
t
– (Gordin’s condition): there is not too much serial correlation in gt
≡ x · t
�
�
εt εt = . (f t st ) εt
−
Since I t includes εt 5 , εt 6 , . . . , the forecast error satisfies (6.8.5). gt inherits the same property, namely, E(gt gt 5 , gt 6 , . . .) = 0 . (6.8.10)
{
−
−
}
{ }
|
• Since E(g g t
with q = 4.
′
t− j
) = 0 for j
−
−
≥ 5, the long-run covariance matrix S can be estimated as in (6.6.4)
• The consistent estimator of its asymptotic variance is given by Avar(b) = S
where S is given in (6.6.4) with q = 4. Table 6.2.
7
1
−
xx
S S 1, −
xx
(6.8.11)
Table 6.2: Regression Tests of Market Efficiency: 1975–1989 s30t Currency = Y/$
£/$
t
0
1
Regression Coefficients Forward Constant premium
t
t
t
R2
Wald Statistic for H 0: β 0 = 0, β 1 = 1
−12.8
−2.10
0.034
18.6 ( p = 0.009%)
−13.6
0.026
(5.72)
−3.01
8.7 ( p = 1.312%)
7.96 (3.54)
−2.02
0.033
12.9 ( p = 0.156%)
(4.01)
DM/$
− s = β + β (f − s ) + ε
(0.738) (1.37) (0.85)
Note:
Standard errors are in parentheses. They are heteroskedasticity-robust and allow for serial correlation up to four lags calculated by the formula (6.8.11). The sample size is 778. Our results differ slightly from those in ? because they used the Bartlett kernel-based estimator with q (n) = 4 to estimate S.
8
Figure 6.1: Forecast Error, Yen/Dollar
Figure 6.2: Correlogram of s30
9
− f , Yen/Dollar
Figure 6.3: Yen/Dollar Spot Rate, Jan. 1975–Dec. 1989
Figure 6.4: Plot of s30 against f , Yen/Dollar
10
Figure 6.5: Plot of s30
− s against f − s, Yen/Dollar
11
GRIPS, Spring 2016
Unit-Root Econometrics
1. Modeling Trends
• Most time series have “trends” – log U.S. GDP: linear deterministic trend. Figure 9.1. – exchange rates: stochastic trend (i.e., a permanent shock). Figure 6.3.
• Definition: A stochastic process {ξ } is:
I(0) if ξ t = δ + u t , ut zero-mean stationary with positive LR variance; I(1) if its first difference is I(0): ∆ξ t = δ + ut . t
– Why LR variance > 0? Consider vt
≡ ε − ε t
t−1
• An I(1) process ξ
t
– can be written as. ξ t = ξ 0 + δ t + ( u1 + u2 +
·
·· · + u ), t
where ξ 0 is the initial value (can be random)
– can have both a linear trend and a stochastic trend. – δ = 0: a driftless I(1) process. δ = 0: an I(1) process with drift .
̸
1
(9.1.4)
2.
Tools for Unit-Root Econometrics
Linear I(0) Processes
• The version of FCLT dictates the class of I(0) processes. • Definition: A linear I(0) process can be written as a constant plus a zero-mean linear process {u } such that (9.2.1) u = ψ (L)ε , ψ (L) ≡ ψ + ψ L + ψ L + ·· · for t = 0, ±1, ±2, . . .. {ε } is independent white noise (i.i.d. with mean 0 and E(ε ) ≡ σ > 0), (9.2.2) j |ψ | < ∞ (9.2.3a) (9.2.3b) ̸0 ψ (1) = – one-summability ⇒ absolute summability. So γ abs. summable (Proposition 6.1(c)). t
2
t
t
0
1
2
2
t
2
t
� ∞
j
j =0
j
– ut ergodic as well as stationary (Proposition 6.1(d)).
– LR variance equals σ 2 [ψ (1)]2 by Proposition 6.6 and 6.8. So it is positive.
• Example: ARIMA. Let u be (stationary) ARMA( p,q ): t
φ(L)ut = θ (L)εt
(9.2.10)
– Verify (9.2.1)–(9.2.3) assuming εt is independent w.n. and θ(1) = 0.
̸
– Let ∆ξ t = u t . Then φ (L)ξ t = θ (L)εt , ∗
with φ (L) = φ (L)(1 ∗
− L). One of the roots of φ (z ) = 0 is unity. ∗
2
(9.2.12)
Approximating I(1) by a Random Walk
• Propositon 9.1 (Beveridge-Nelson decomposition): Let {u } be a zero-mean I(0) process t
satisfying (9.2.1)–(9.2.3). Then u1 + u2 +
where ηt
≡ α(L)ε , α = −(ψ t
j
· ·· + u
j +1
t
= ψ (1)(ε1 + ε2 +
+ ψ j+2 +
·· · + ε ) + η − η , t
t
0
·· · ). {η } is zero-mean ergodic stationary. t
– The key is to show ψ (L) = ψ (1) + ∆α(L).
The Wiener Process
• Definition: Standard Wiener processes. A standard Wiener (Brownian motion) process W (·), t ∈ [0, 1] is such that (1) W (0) = 0; (2) for any dates 0
·· · < t ≤ 1, the changes W (t ) − W (t ), W (t ) − W (t ), . . . , W ( t ) − W (t ) are independent multivariate normal with W (s) − W (t) ∼ N (0, (s − t)) (so in particular W (1) ∼ N (0, 1)); ≤t
1
< t2 < 2
k
1
3
2
k
k−1
(3) for any realization, W (t) is continuous in t with probability 1.
– If the variance of the instantaneous change is σ 2, not unity, the Wiener process is σW (r).
• Explain the FCLT. Figure 9.2. (t = 0, 1, . . . , T − 1) ↔ W (r) ≡ • Demeaned Wiener Processes. ξ ≡ ξ − W (r ) − W (s) ds. 1) ↔ W (r) ≡ • Detrended Wiener Processes. ξ ≡ ξ − α − δ · t (t = 0, 1, . . . , T − W (r ) − a − d · r, a ≡ (4 − 6s)W (s) ds, d ≡ (−6 + 12s)W (s) ds. µ t
∫
ξ0 +ξ1 +···+ξT −1 T
t
1
µ
0
∫ 1
τ t
t
0
∫ 1
0
3
τ
The Lemma Let ξ t be driftless I(1). Let λ2
• Proposition 9.2:
{ }
Var(∆ξ t ). Then (a)
(b)
(c)
(d)
(e)
(f)
1
T
� � � � � �
2
T
1
2
(ξ t 1 ) −
2
→λ d
t=1
T
1 T
1
∆ξ t ξ t
T
1 T
1
T 2
1 T
→
1
−
t=1
λ2
2
d
W (1)2
� · � � · �
− γ 2 . 0
1
(
µ ξ t−1 2
)
t=1
2
→λ d
T
µ ξ t ξ t−1
∆
→
t=1
λ2
2
d
[W µ (r)]2 dr,
0
T
[W µ (1)]2
�
− [W (0)] − γ 2 . µ
2
0
1
(
ξ tτ −1 2
)
t=1
2
→λ d
T
ξ t ξ tτ −1
∆
t=1
W (r )2 dr ,
0
T
2
� ·
≡ long-run variance of {∆ξ }
→ d
[W τ (r)]2 dr,
0
λ2
2
[W τ (1)]2
�
− [W (0)] − γ 2 . τ
2
– The convergence is joint . – In (a), why deflation by T 2 ? – Why λ2 rather than γ 0 ? BN!
4
0
t
and γ 0
≡
Figure 9.1: Log U.S. Real GDP, Value for 1869 set to 0
Figure 6.3: Yen/$ Spot Rate, Jan 75 - Dec. 89
5
Figure 9.2: FCLT in Action
√ Generate a random walk of length T , deflate by T , and then horizontally compress [0, 1, 2,...,T ] to [0, 1].
6
3.
Dickey-Fuller Tests
The Null is Random Walk without Drift. DF ρ , DF t Tests
• Want to test whether {y } is a r. w. without drift: t
∆yt = ε t, so yt = y 0 + ε1 + ε2 +
··· + ε . – The model parameters are: y (could be random) and γ ≡ Var(∆y ). t
0
t
0
– The test statistic, based on the sample (y0 , y1 ,...,yT ), should have a known distribution but should not depend on y0 and γ 0 .
• The most popular test is “DF” tests. It derives the test statistic from an AR(1) equation. – DF ρ , DF t : from yt = ρy t 1 + εt . −
µ t
– DF µρ , DF : from yt = α + ρyt 1 + εt . ∗
−
• DF , DF Tests. ρ
t
– ρ : OLS estimate from yt = ρy t 1 + εt . So
−
∑ − ≡ ∑ √ ∑ −
T t=1 yt yt−1 T 2 t=1 yt−1
1
ρ
ρ
t =
s
(
T t=1 T t=1 T t=1
−1 =
1
1
T t=1
÷
)
∑ ∑ ∑ √
T
=
(yt 1 )
2
s
−
·
1
T 2
∆yt yt
1
−
(9.3.2)
(yt 1 )2 −
∆yt yt
1
−
∑
T t=1
(9.3.7)
(yt 1 )
2
−
– Use Proposition 9.2(a),(b) to derive the large-sample distribution of T (ρ 1
T (ρ
· − 1) =
T
∑ ∑
1
2
T
T t=1 T t=1
∆yt yt
1
−
t
→
2
d
1 2
1
γ 0
0
d
1
−
γ 0 2
W (r)2 dr
(W (1)2 0
This is Proposition 9.3.
W (1)2
· ∫ → √ ∫ −
(yt 1 )2 −
γ 0
1)
1 2
=
(W (1)2
∫ 1
0
≡ DF
W (r ) dr
1), t.
− 1) ≡ DF
ρ
W (r)2 dr
t
2
−
(9.3.6) (9.3.9)
Proposition 9.3: (Dickey-Fuller tests of a driftless random walk, the case without intercept): Suppose that yt is a driftless random walk (so ∆yt is independent white noise) with E(y02) < . Consider the regression of yt on yt 1 (without intercept) for t = 1, 2, . . . , T . Then
∞
{ }
{ }
−
· − 1) → DF , DF t statistic: t → DF ,
DF ρ statistic: T (ρ
d
d
ρ
t
where ρ is the OLS estimate of the yt 1 coefficient, t is the t-value for the hypothesis that the yt 1 coefficient is 1, and DF ρ and DF t are the random variables defined in (9.3.6) and (9.3.9). −
−
7
• Comments on DF , DF . ρ
t
– Finite-sample distribution of ρ with y0 = 0 and εt
– Comment on Table 9.1(a), Table 9.2(a).
∼ N (0, γ ). Fig. 9.3.
(a) Finite-sample distribution assumes y0 = 0 and εt invariant to γ 0. (b) What about the large sample distribution? (c) Test is one-sided b/c the alternative is I(0).
8
0
∼ N (0, γ ). 0
The distribution is
The Null is Random Walk without Drift. DF µρ , DF µt Tests µ t
µ ρ
• DF , DF Tests. – ρµ : OLS estimate from yt = α + ρyt 1 + εt . The AR equation can be derived from
∗
−
yt = α + z t , z t = ρz t 1 + εt ,
{ε } independent white noise α = (1 − ρ)α. with ε ∼ N (0, γ ). Fig. 9.3. t
−
∗
– Finite-sample distribution of ρµ
t
(9.3.10) (9.3.12)
0
– Comment on Table 9.1(b), Table 9.2(b).
(a) Finite-sample distribution assumes εt N (0, γ 0 ). The distribution is invariant not only to γ 0 but also to y0 (or its distribution). (b) Test is one-sided b/c the alternative is I(0).
∼
– ρ = 1 and α = 0 under the null. But but this joint null is rarely tested. ∗
Proposition 9.4 (Dickey-Fuller tests of a driftless random walk, the case with intercept): Suppose that yt is a driftless random walk (so ∆yt is independent white noise) with E(y02 ) < . Consider the regression of yt on (1, yt 1 ) for t = 1, 2, . . . , T . Then
{ }
∞
{ }
−
DF ρµ statistic: T (ρµ
·
µ
− µ
DF t statistic: t
µ ρ
→ DF ,
1)
d
µ t
→ DF , d
where ρµ is the OLS estimate of the y t 1 coefficient, t µ is the t -value for the hypothesis that the µ yt 1 coefficient is 1, and DF ρµ and DF t are the two random variables defined in (9.3.13) and (9.3.14). −
−
DF µρ
≡
µ t
DF
≡
1 2
� − ∫ � − √ ∫ [W µ (1)]2 1
[W µ (r)]2 dr 0
1 2
[W µ (1)]2 1
0
.
[W µ (0)]2
[
[W µ (0)]2
W µ
(r)] dr 2
� − � −1 1
,
(9.3.13)
.
(9.3.14)
Example 9.1: the log yen/$ exchange rate yt = 0.162 + 0.9983376 yt (0.435) (0.0019285)
1
−
T (ρµ
·
, R2 = 0.997, SER = 2.824.
1) = 777 (0.9983376 1) = 0.9983376 1 tµ = = 0.86. 0.0019285
−
·
9
−
− −
−1.29,
The Null is Random Walk with or without Drift. DF τ ρ, DF τ t Tests τ ρ
τ t
• DF , DF Tests. The null is a random walk possibly with drift: y
t
= y 0 + δ t + ( ε1 +
δ may or may not be zero.
· ·· + ε ). t
– ρτ : OLS estimate from y t = α + δ t + ρyt 1 + εt . Under the null, α = δ and δ = 0. We include δ to allow trend stationary processes against the alternatives. The AR equation can be derived from
∗
∗
∗
∗
−
∗
yt = α + δ t + z t , z t = ρz t 1 + εt ,
{ε } independent white noise α = (1 − ρ)α + ρδ, δ = (1 − ρ)δ – Finite-sample distribution of ρ with δ = 0 and ε ∼ N (0, γ ). Fig. 9.3. ·
t
−
∗
∗
τ
t
(9.3.15) (9.3.18)
0
– Comment on Table 9.1(c), Table 9.2(c).
(a) Finite-sample distribution assumes εt N (0, γ 0 ). The distribution is invariant not only to γ 0 and δ but also to y0 (or its distribution). Prob(ρτ < 1) > 0 .99! (b) Test is one-sided b/c the alternative is I(0).
∼
– ρ = 1 and α = 0 under the null. But but this joint null is rarely tested. ∗
Proposition 9.5 (Dickey-Fuller tests of a random walk with or without drift): Suppose that yt is a random walk with or without drift with E(y02 ) < . Consider the regression of yt on (1, t , yt 1) for t = 1, 2, . . . , T . Then
{ }
∞
−
DF ρτ statistic: T (ρτ
·
τ
− τ
DF t statistic: t
1)
τ ρ
→ DF , d
τ t
→ DF , d
where ρτ is the OLS estimate of the y t 1 coefficient, t τ is the t -value for the hypothesis that the yt 1 coefficient is 1, and DF ρτ and DF tτ are the two random variables defined in (9.3.19) and (9.3.20). −
−
DF ρτ τ t
DF
≡ ≡
1 2
� − ∫ � − √ ∫ [W τ (1)]2
[W τ (0)]2
1
[W τ (r)]2 dr 0
1 2
[W τ (1)]2 1
0
[
[W τ (0)]2
W τ
10
(r)] dr 2
� − � −1 1
,
(9.3.19)
.
(9.3.20)
Figure 9.3: Finite Sample Distribution of OLS Estimate of AR(1) Coefficient, T = 100
11
Table 9.1: Critical Values for the Dickey-Fuller ρ Test Sample size (T ) 25 50 100 250 500
∞ 25 50 100 250 500
∞ 25 50 100 250 500
∞
Probability that the statistic is less than entry 0.01 0.025 0.05 0.10 0.90 0.95 0.975 0.99 Panel (a): 9.3 7.3 9.9 7.7 10.2 7.9 10.4 8.0 10.4 8.0 10.5 8.1
T (ρ 5.3 5.5 5.6 5.7 5.7 5.7
1.01 0.97 0.95 0.94 0.93 0.93
1.41 1.34 1.31 1.29 1.28 1.28
1.78 1.69 1.65 1.62 1.61 1.60
2.28 2.16 2.09 2.05 2.04 2.03
Panel (b): 14.6 12.5 15.7 13.3 16.3 13.7 16.7 13.9 16.8 14.0 16.9 14.1
T (ρµ 10.2 10.7 11.0 11.1 11.2 11.3
1) 0.76 0.81 0.83 0.84 0.85 0.85
0.00 0.07 0.11 0.13 0.14 0.14
0.65 0.53 0.47 0.44 0.42 0.41
1.39 1.22 1.14 1.08 1.07 1.05
−2.51 −2.60 −2.63 −2.65 −2.66 −2.67
−1.53 −1.67 −1.74 −1.79 −1.80 −1.81
−0.46 −0.67 −0.76 −0.83 −0.86 −0.88
−11.8 −12.8 −13.3 −13.6 −13.7 −13.8
− − − − − −
− − − − − −
− − − − − −
−17.2 −18.9 −19.8 −20.3 −20.5 −20.7
− − − − − −
− − − − − −
− − − − − −
−22.5 −25.8 −27.4 −28.5 −28.9 −29.4
· − 1)
·
−−
− − − − − Panel (c): T · (ρ − 1) −20.0 −17.9 −15.6 −3.65 −22.4 −19.7 −16.8 −3.71 −23.7 −20.6 −17.5 −3.74 −24.4 −21.3 −17.9 −3.76 −24.7 −21.5 −18.1 −3.76 −25.0 −21.7 −18.3 −3.77
12
τ
− − − − −
Table 9.2: Critical Values for the Dickey-Fuller t-Test Sample size (T ) 25 50 100 250 500
∞ 25 50 100 250 500
∞ 25 50 100 250 500
∞
Probability that the statistic is less than entry 0.01 0.025 0.05 0.10 0.90 0.95 0.975 0.99 Panel (a): t 1.95 1.60 0.92 1.95 1.61 0.91 1.95 1.61 0.90 1.95 1.62 0.89 1.95 1.62 0.89 1.95 1.62 0.89
1.33 1.31 1.29 1.28 1.28 1.28
1.70 1.66 1.64 1.63 1.62 1.62
2.15 2.08 2.04 2.02 2.01 2.01
Panel (b): tµ 2.99 2.64 2.93 2.60 2.90 2.59 2.88 2.58 2.87 2.57 2.86 2.57
−2.65 −2.62 −2.60 −2.58 −2.58 −2.58
−2.26 −2.25 −2.24 −2.24 −2.23 −2.23
− − − − − −
− − − − − −
−3.75 −3.59 −3.50 −3.45 −3.44 −3.42
−3.33 −3.23 −3.17 −3.14 −3.13 −3.12
− − − − − −
− − − − − −
−0.37 −0.41 −0.42 −0.42 −0.44 −0.44
− − − − −
0.00 0.04 0.06 0.07 0.07 0.08
0.34 0.28 0.26 0.24 0.24 0.23
0.71 0.66 0.63 0.62 0.61 0.60
−4.38 −4.16 −4.05 −3.98 −3.97 −3.96
−3.95 −3.80 −3.73 −3.69 −3.67 −3.67
− − − − − −
− − − − − −
−1.14 −1.19 −1.22 −1.23 −1.24 −1.25
−0.81 −0.87 −0.90 −0.92 −0.93 −0.94
−0.50 −0.58 −0.62 −0.64 −0.65 −0.66
−0.15 −0.24 −0.28 −0.31 −0.32 −0.32
Panel (c): tτ 3.60 3.24 3.50 3.18 3.45 3.15 3.42 3.13 3.42 3.13 3.41 3.13
13
GRIPS, Spring 2015
Unit-Root Econometrics Preview • The I(1) processes considered in Section 9.3 and 9.4: ∆yt = δ + ut , {ut } ∼ zero-mean I(0). Proposition 9.3, 9.4: δ = 0, {ut } independent white noise (y0 = 0 for finite-sample distributions), Proposition 9.5: {ut } independent white noise,
• Section 4: Proposition 9.6, 9.7: δ = 0, { ut } is AR( p), p known Proposition 9.8: {ut } is AR( p), p known, Said-Dickey extension of Proposition 9.6, 9.7: δ = 0, {ut } is invertible ARMA( p, q ), p, q unknown, Said-Dickey extension of Proposition 9.8: {ut } is invertible ARMA( p, q ), p, q unknown.
• Section 6: Application to PPP
1
4.
Augmented Dickey-Fuller Tests
Proposition 9.6, 9.7 • The null is ∆yt = ζ 1 ∆yt 1 + ζ 2 ∆yt 2 + · · · + ζ p ∆yt p + εt , E(ε2t ) ≡ σ2 . −
−
−
(9.4.1)
Proposition 9.6 (Augmented Dickey-Fuller test of a unit root without intercept): Suppose that {yt } is ARIMA( p, 1, 0) so that {∆yt } is a zero-mean stationary AR( p) process with iid errors (9.4.1). Let (ρ, ζ 1 , ζ 2 , . . . , ζ p ) be the OLS coefficient estimates from the augmented autoregression (9.4.3) yt = ρy t 1 + ζ 1 ∆yt 1 + ζ 2 ∆yt 2 + · · · + ζ p ∆yt p + εt . Then
−
−
ADF ρ statistic:
−
−
T · (ρ − 1)
1 − ζ1 − ζ2 − · · · − ζ p
→ DF ρ d
ADF t statistic: t → DF t .
(9.4.25)
d
(9.4.26)
• Allow {yt } to differ by an unspecified constant by replacing the yt in (9.4.3) by yt − α, which yields (9.4.33) yt = α + ρyt 1 + ζ 1 ∆yt 1 + ζ 2 ∆yt 2 + · · · + ζ p ∆yt p + εt , ∗
−
−
−
−
where α = (1 − ρ)α. ∗
– As in the AR(1) model with intercept, the I(1) null is the joint hypothesis that ρ = 1 and α = 0. Nevertheless, unit-root tests usually focus on the single restriction ρ = 1. ∗
Proposition 9.7 (Augmented Dickey-Fuller test of a unit root with intercept): Suppose that {yt } is an ARIMA( p, 1, 0) process so that {∆yt } is a zero-mean stationary AR( p) process following (9.4.1). Consider estimating the augmented autoregression with intercept, (9.4.33), and let (α, ρµ , ζ1 , ζ 2 , . . . , ζ p ) be the OLS coefficient estimates. Also let t µ be the t-value for the hypothesis that ρ = 1. Then
T · (ρµ − 1)
µ
ADF ρ statistic:
1 − ζ1 − ζ 2 − · · · − ζ p ADF t µ statistic: tµ → DF tµ . d
→ DF ρµ , d
(9.4.34) (9.4.35)
• Furthermore: can do usual t and F tests on the coefficients of zero-mean I(0) coefficients (ζ 1 ,...,ζ p ).
2
Incorporating Time Trend • The equation to estimate: yt = α + δ · t + ρyt 1 + ζ 1 ∆yt 1 + ζ 2 ∆yt 2 + · · · + ζ p ∆yt p + εt , ∗
∗
−
−
−
−
(9.4.36)
where α = (1 − ρ)α + (ρ − ζ 1 − ζ 2 − · · · − ζ p )δ, δ = (1 − ρ)δ. ∗
∗
(9.4.37)
Proposition 9.8 (Augmented Dickey-Fuller Test of a unit root with linear time trend): Suppose that {yt } is the sum of a linear time trend and an ARIMA( p, 1, 0) process so that {∆yt } is a stationary AR( p) process whose mean may or may not be zero. Consider estimating the augmented autoregression with trend, (9.4.36), and let ( α, δ, ρτ , ζ 1 , ζ 2 , . . . , ζ p ) be the OLS coefficient estimates. Also let t τ be the t-value for the hypothesis that ρ = 1. Then
T · (ρτ − 1)
τ
ADF ρ statistic:
1 − ζ 1 − ζ2 − · · · − ζ p ADF tτ statistic: tτ → DF tτ . d
3
→ DF ρτ , d
(1) (2)
What if ∆yt is not AR( p)? • Said-Dickey-Ng-Perron extension: Suppose ∆yt is zero- mean invertible ARMA( p, q ) with p, q unknown. Let p max (T ) satisfy pmax (T ) → 0 as T → ∞ T 1/3 Pick p by AIC or BIC. Proposition 9.6 goes through. pmax (T ) → ∞ but
(9.4.28)
)
(9.4.31)
A suggested procedure. 1. Set p max (T ) as
T pmax (T ) = 12 · 100
1/4
(integer part of 12 ·
T 100
1/4
2. Pick p that minimizes (with C (x) = 2 if Akaike, C (x) = log(x) if BIC)
log
SSR j
T − pmax (T ) − 1
+ ( j + 1) ×
C (T − pmax (T ) − 1) T − pmax (T ) − 1
(9.4.32)
on the fixed sample t = p max (T ) + 2,...,T (when the original sample is t = 1, 2,...,T ).
• Proposition 9.7 and 9.8 go thru.
4
Example 9.3 log US GDP, 1947:Q1–1998:Q1 (205 obs). Include intercept and trend. The suggested procedure yields p max (T ) = 14, p = 1 by both AIC and BIC.
yt = 0.22 + 0.00022 · t + 0.9707 yt (0.10) (0.00011) (0.0137) R2 = 0.999,
1
−
+ 0.348∆yt 1 , (0.066) −
sample size = 203.
From this estimated regression, we can calculate 0.9707388 − 1 = −9.11, 1 − 0.348 0.9707388 − 1 = −2.13. tτ = 0.0137104
ADF ρ τ = 203 ×
log US GDP, 1869-1997 (129 obs). Include intercept and trend. The suggested procedure yields pmax (T ) = 12, p = 1 by both AIC and BIC.
yt = 0.67 + 0.0048 · t + 0.85886 yt (0.18) (0.0014) (0.03989)
1
−
R2 = 0.999,
+ 0.32 ∆yt 1 , (0.086)
sample size = 127.
The ADF ρτ and t τ statistics are −26.2 and − 3.53, respectively.
5
−
6.
PPP • Various versions of PPP: – Law of one price — for each good, the price, converted to a common currency, is the same across countries. – “Absolute” PPP — P t = S t × P t or p t = s t + pt , where P t and P t are price indices . ∗
∗
∗
– “Relative” PPP — it holds in first differences. – Weak version of PPP — p t − st − pt (log real exchange rate) is I(0). ∗
• Very difficult to reject the random walk null. Lothian-Taylor (1996). Uses two centuries of evidence for dollar/pound (1791-1990). Do the ADF test as described in Section 4. pmax (T ) = 14, BIC picks p = 0. So reduces to DF test.
z t = 0.179 + 0.8869 z t 1 , SER = 0.071, R2 = 0.790, t = 1792, . . . , 1990. (0.052) (0.0326) −
The DF t-statistic (tµ ) is −3.5 (= (0.8869 − 1)/0.0326), which is significant at 1 percent. Thus, we can indeed reject the hypothesis that the sterling/dollar real exchange rate is a random walk.
Figure 9.4: Dollar/Pound Log Real Exchange Rate, 1791-1990; 1914 value set to zero
6
GRIPS, Spring 2016
Cointegration
1.
Cointegrated Systems
Linear Vector I(0) and I(1) Processes • Recall: for the univariate (scalar) case, – A zero-mean univariate I(0) process:
ut = ψ (L)εt , εt iid, E(εt ) = 0, E(ε2t ) = σ 2 > 0 , {ψ j } one-summable, ψ (1) = 0.
So the LR (long-run) variance is σ 2 ψ (1)2 . – Define I(1) as: ∆yt = δ + ut .
• A zero-mean vector I(0) process: ut = Ψ (L) εt , εt iid, E(εt ) = 0 , E(εt εt ) = Ω , Ω pd, { Ψ j } one-summable.
(n×1)
(n×n)
(n×n) (n×1)
(*)
So the LR variance is Ψ(1)ΩΨ(1) .
Ψ(1) =
0 .
(n×n)
• A vector I(1) process (system): ∆yt = δ + ut .
1
(**)
BN Decomposition Decomposition • The BN decomposition generalizes in the obvious way: yt = δ · t + Ψ(1)(ε1 + ε2 + · · · + εt ) + η t + ( y0 − η 0 )
(10.1.16)
where η t ≡ α (L)ut and α j ≡ −(Ψ j +1 + Ψ j +2 + · · · ).
(10.1.15)
Cointegration Defined • Multiply both sides by a to obtain
a yt = a δ · t + a Ψ(1)(ε1 + ε2 + · · · + εt ) + a η t + a (y0 − η 0 ).
(10.1.19)
• Definition: Let yt be an n-dimensional I(1) system whose associated zero-mean I(0) system ut satisfies satisfies (*) and (**) abov ab ove. e. We say that yt is cointegrated with an n-dimensional cointegrating vector a if a = 0 and a yt can be made trend stationary by a suitable choice of its initial value y0 .
• So an I(1) system yt is cointegrated if and only if a Ψ(1) = 0
(1×n)
(10.1.20)
• The cointegration rank h ≡ n − rank(Ψ(1)). • Implications of the definition. 1. h < n. 2. Ψ(1)ΩΨ(1) p.d. ⇒ yt is not cointegrated (i.e., h = 0). 0). (“⇐” as well in the present linear case.)
3. A stationary stationary VAR VAR in first differences differences is not cointegrate cointegrated. d. Φ(L)∆yt = ε t .
2
The Example • Consider the following bivariate first-order VMA process:
where
u1t = ε 1t − ε1,t−1 + γε γ ε2,t−1 ,
or ut = ε t + Ψ1 εt 1 ,
−
u2t = ε 2t ,
ε −1 γ . εt = 1t , Ψ1 = 0 0 ε2t
(10.1.6)
(10.1.7)
Since this is a finite-orde finite-orderr VMA, the one-summab one-summabilit ility y condition is trivially trivially satisfied. The requirement of nonzero Ψ(1) is satisfied because Ψ(1) = I 2 + Ψ1 =
0 γ = 0 . 0 1 (2 2)
(10.1.8)
×
• The associated I(1) system is given by ∆ yt = δ + ut . In levels,
y1t = y 1,0 + δ 1 · t + ( ε1t − ε1,0 ) + γ · · (ε2,0 + ε2,1 + · · · + ε2,t−1 ),
(10.1.13)
y2t = y 2,0 + δ 2 · t + ( ε2,1 + ε2,2 + · · · + ε2t ).
decomposition?? The matrix matrix version version of α(L) equals −Ψ1 so that the stationary com• The BN decomposition ponent η t in the BN decomposition is η t =
ε1t − γε 2t
0
.
(10.1.17)
The two-dimensional stochastic trend is written as
0 γ Ψ(1)(ε1 + ε2 + · · · + εt ) = 0 1
t s=1 ε1s t s=1 ε2s
=
γ · ·
t s=1 ε2s t s=1 ε2s
.
Thus, the two-dimensional stochastic trend is driven by one common stochastic trend,
(10.1.18) t s=1 ε2s
.
cointegrating ing vectors vectors can • The rank of Ψ(1) is 1, so the cointegration rank is 1 (= 2 − 1). All cointegrat be written as ( c, −cγ ) , c = 0.
• Consider
δ 1 0 γ . δ .. Ψ(1) = δ 0 1 . 2
If the cointegration vector also eliminates the deterministic trend, then cδ 1 −cγδ 2 = 0 or δ 1 = γ δ 2 , . which implies that the rank of [ δ .. Ψ (1)] shown above is one.
3
2.
Altern Alternativ ativee Rep Repres resent entatio ations ns of Cointeg Cointegrat rated ed Systems Systems ut = Ψ (L)εt , ∆yt = δ + ut .
• Three Representations of cointegrated systems 1. Triangular 2. VAR 3. VECM
Triangular Representation (skip) VAR • Consider yt = α + δ t + ξ t ,
where ξt is (finite-order) VAR in levels (so Φ(L)ξ t = ε t ).
• Eliminate ξ to obtain yt = α ∗ + δ ∗ · t + Φ1 yt−1 + Φ2 yt−2 + · · · + Φ p yt− p + εt ,
δ ∗ ≡ Φ (1) δ
(n×1)
(10.2.14)
• What is a necessary and sufficient condition for ξ t to be I(1) with cointegration rank h? • Answer: Φ(L) = U (L) M(L) V(L) with
(1 − L)In
M(L) =
−h
0
((n−h)×h)
0
Ih
(h×(n−h))
, U(L), V(L) stable.
theree exis exists ts n × h full column rank • It follows from this condition that rank( Φ(1)) = h. So ther matrices A and B such that Φ(1) = B A (10.2.21)
(n×h) (h×n)
(n×n)
• Derive: Φ(L)Ψ(L) = (1 − L)In so Φ(1) Ψ(1) = (n×n) (n×n)
0
(n×n)
• Verify that A collects linearly independent cointegrating vectors.
4
(10.2.17, 10.2.18)
VECM • Just a matter of algebra to derive from (10.2.14): ∗
∗
∆yt = α + δ · t − Φ(1) yt 1 + ζ 1 ∆yt 1 + · · · + ζ p 1 ∆yt −
∗
∗
= α + δ · t − B zt
−1 (n×h)(h×1)
−
−
+ ζ 1 ∆yt 1 + · · · + ζ p 1 ∆yt −
−
+1 + εt
− p
− p
+1 + εt
(10.2.26)
(10.2.27)
where ζ s ≡ −(Φs+1 + Φs+2 + · · · + Φ p ) for s = 1, 2, . . . , p − 1
(10.2.24)
(n×n)
zt ≡ A
yt
(h×n)(n×1)
(h×1)
• Have covered the Granger Representation Theorem.
5
(10.2.28)
Sprilng 2015
GRIPS
Advanced Ecoonmetrics III, Final
Part A: True/False Questions (5 points each) Are the following statements true or false? Justify your answer in one to three lines. (a) “xn
→ x, y → y” ⇒ “x + y → x + y”. (b) In order for the filter 1 − φL to have an inverse, it is necessary that |φ| < 1. d
n
n
d
n
d
(c) The first-order autocorrelation coefficient of any MA(1) process is less than or equatl to 0.5. (d) For finite-samples, the numerical value of the DF ρµ statistic derived from a sample of size T + 1, (y0 , y1 ,...,yT ), does not depend on the initial value y 0 if yt is a random walk without drift.
{ }
Part B: Short Questions (10 points each) 1. Lemma 2.4 (a)-(c) for scalar random variables can be stated as (a) “xn
→ x, y → α” ⇒ “x + y → x + α”. (b) “x → x, y → 0” ⇒ “x y → 0”. (c) “x → x, a → a” ⇒ “a x → ax”. d
n
p
n
n
d
n
p
n n
p
n
d
n
p
n n
d
d
n
Prove Lemma 2.4(c) from Lemma 2.4(a) and (b). 2. Let xt be a covariance-stationary process. Consider two processes, yt = x t 2.5xt 1 and z t = x t 0.4xt 1 . What is the relationship between the autocorrelation coefficients of yt and the autocorrelation coefficients of zt ?
{ }
− { }
−
{ }
−
−
3. Calculate the first-order autocorrelation coefficient (not autocovariance) for the ARMA(1,1) process yt 0.5yt 1 = ε t + εt 1 .
−
−
−
4. In the unconditional test of the efficiency of the foreign exchange market discussed in class, Gordin’s CLT was applied to εt , the unexpected change in the spot rate. Consider applying the multivariate version of Gordin’s CLT to gt where gt = (εt , (f t st )εt ) . Write down the test statistic you would use and indicate its limiting distribution. Here, you do not need to derive your result. Just write down the formula for the statistic.
−
′
5. Consider a likelihood function (θ) with only one parameter θ. The sample size is 100. Suppose that the ML estimator, which maximizes (θ), is θ = 1.0. At this ML estimate, the value of the Hessian is 2.0. Calculate, to the second decimal point, the t value used for testing the hypothesis that θ 0 = 0.
L
−
L
Part C: A Longer Question (30 points) Consider the stationary ARMA(1,1) model: yt = φy t 1 + ut , ut = ε t + θεt 1 with φ < 1 and ε t i.i.d. with mean zero and variance σ 2 . Let φ be the OLS estimator obtained from regressing y t on yt 1 (so a constant is not included as a regressor). Derive its probability limit. [Hint: −
∑ ∑ φ =
1 T 1 T
T t=1 yt−1 yt T 2 t=1 yt−1
−
| |
−
= φ +
2
∑ ∑
1 T 1 T
T t=1 yt−1 ut . T 2 y t=1 t−1
]
Spring 2015
GRIPS
Advanced Econometrics III, Final, Answer Sheet
Part A: True/False Questions (a) False. The convergence of (xn , yn ) to (x, y) must be joint. (b) False. The inverse can be defined for any filter as long as the constant term (1 in this example) is not zero. (c) True. Let y t = ε t + θε t
1
−
. Then ρ 1 =
θ 1+θ2
. This achieves a maximum of 0.5 when θ = 1.
| |
(d) True. This is because the intercept in the estimated equation offsets the initial condition.
Part B: Short Questions 1. an xn = (an
− a)x + ax
n.
n
By (b), (an
− a)x → 0. So the desired result follows from (a) with α = 0. p
n
2. Let γ j be the j-th order auotcovariance of xt , and let γ jw be the j-th order auotcovariance of wt for w = y, z. Then γ jy = [1 + (2.5)2 ]γ j 2.5γ j 1 2.5γ j +1 , and γ jz = [1 + (0.4)2 ]γ j 0.4γ j 1 0.4γ j +1 . So for all j, γ jy = (2.5)2 γ jz . Thus the autocorrelations are the same.
{ } −
−
−
−
−
{ }
−
3. The ARMA process is (1 φL)yt = εt + ε t 1 with φ = 0.5. This process can be represented as yt = (1 + φL + φ2 L2 + )εt + ( 1 + φL + φ2 L2 + )εt 1 = ε t + ( 1 + φ)(εt 1 + φεt 2 + φ2 εt 3 + ). So γ 0 = σ 2 + ( 1 + φ)2 1 σ φ = 12σ φ and γ 1 = 11+φφ σ 2 (use Proposition 6.1.(b) to derive these). Thus ρ 1 = 1+2 φ . For φ = 0.5, ρ 1 = 3/4.
···
2
2
−
−
−
···
2
−
−
5. t =
−
−
−
···
−
4. The statistic is ng S 1 g, where g The limiting distribution is χ 2 (2). ′
−
≡
∑
1 n
n t=1
√ − θ
θ0
1 T
Avar(θ )
≡ ∑ ≡ ∑
gt and S
Γ0 +
4 j =1
(θ ) −T × H
1
−
, Avar(θ ) =
′
(Γj + Γj ), Γj
1 n
n t=j +1
gt gt ′
j.
−
.
−2.0. So t = √ 2 = 1.41.
Here, θ = 1.0, θ0 = 0, H (θ) =
Part C: A Longer Questions For a sample period of t = 0, 1, 2, . . . , T , we have
∑ ∑ φ =
1 T 1 T
T t=1 yt−1 yt T 2 t=1 yt−1
= φ +
∑ ∑
1 T 1 T
T t=1 yt−1 ut . T 2 y 1 t − t=1
(1)
Since yt is ergodic stationary by Propositions 6.1(d) and 6.4(a) and since the mean of the process is zero, the numerator in the last ratio converges in probability to Cov(yt 1 , ut ) and the denominator to γ 0 .
{ }
−
3
Since yt is stationary ARMA and φ < 1, yt can be written as a weighted average of ε t , εt ys is uncorrelated with ε t for s < t. Use this fact to derive:
{ }
| |
Cov(yt
1
−
, ut ) = Cov(yt
1
−
= θ Cov(yt
, εt + θε t 1
−
= θ Cov(φyt
, εt 2
−
1
−
1
−
1
−
)
+ εt
1
−
+ θε t
2
−
, εt
1
−
) (2)
+ εt + θε t
1
−
, we obtain
γ 0 = φ 2 γ 0 + (1 + θ 2 )σ 2 + 2φθ Cov(yt
1
−
, εt
1
−
)
= φ 2 γ 0 + (1 + θ 2 )σ 2 + 2φθσ 2 . Solving this for γ 0 , we obtain
So
, . . . . So
)
= θσ 2 . Taking the variance of both sides of yt = φy t
1
−
(3)
1 + θ2 + 2φθ 2 γ 0 = σ . 1 φ2
(4)
θ(1 φ2 ) plim φ = φ + . 1 + θ 2 + 2φθ
(5)
−
−
The denominator is not zero because 1 + θ2 + 2φθ = 1
4
2
−φ
+ (φ + θ)2 and φ < 1.
| |
Spring 2015
GRIPS
Advanced Econometrics III, Final: My Grading Policy
Part A: True/False Questions (a)
• 0 if True, • 4 if False, • 5 if False and mentions the jointness.
(b) The question was tricky, so you get some partial credit regardless of your answer.
• 2 if True, • 3 if True and writes down (1 − φL) = 1 + φL + φ L + ··· , • 5 if False. • 0 if False, • 3 if False, but writes down ρ = , • 2 if True, • 5 if True and writes down ρ = . • 0 if False, • 3 if False but gives the correct reason for “True”, • 3 if True, • 5 if True with the correct reason. 1
2
−
(c)
θ 1+θ2
1
θ 1+θ2
1
(d)
2
Part B: Short Questions 1. You get full credit of 10 points if you note (an get no or partial credit.
− a)x
n vanishes
and then refers to (a). Otherwise you
2. y j
• 3 points for an equation relating γ to the autocovariances of x. 3 points for doing the same for z. Here, you get only partial credit if the equation is incorrect in varying degrees.
• 4 points for stating that the autocorrelations are the same. 2 points for stating that the autocorrelations are the same only for a subset of lags.
3.
• 8 points for correct γ
0
and γ 1 .
– If you used the MA representation,
∗ 5 points for the correct MA rep, 2 points for incorrect MA reps. 5
∗ 3 points for the correct mapping from MA coeffs to γ ’s, 1 point for incorrect mappings. – If you used the Yule-Walker equation, 1-8 points depending on the degree of completeness.
• 2 points for the correct value of ρ . 1
4.
• 7 points for the correct test stat. – 1 for writing down the correct expression for g,
– 3 for S.
• 3 for the correct asymptotic distribution. The above is provided that g is 2 × 1. You get only 2 points if your g is univariate.
No credit for any description of the conditional test. You receive 1 point if you describe some version of the Wald statistic. 5. 5 points for writing down (some version of) the t value that involves the Hessian. 5 points for the correct t value.
Part C: A Longer Questions
• 2 points for writing down the equation in the hint. • 7 points for the correct expression for the plim of φ.
– 1 for indicating the plim of the denominator of the ratio
∑T yt−1 ut t=1 ∑ T 2
1 T 1 T
t=1
y t−1
(eve if the plim is
wrong). 1 for the plim of the numerator. You get only 1 point for mentioning the plim of the ratio (even if the plim is wrong). – 2 for some attempt to calculate γ 0 . – 1 for not presuming that the plim of the numerator is 0.
• 1 for mentioning that the denominator is not zero if |φ| < 1. That would be if full credit were 10 points. Your point is three times the point determined thus.
6
Sprilng 2016
GRIPS
Advanced Ecoonmetrics III, Final
Part A: Multiple Choice Questions (30 points) 6 points each. For each question, you lose one point if your answer is wrong.
1. Is the following statement true or false? “The first-order autocorrelation coefficient of any MA(1) process is less than or equal to 0.5.” (a) True. (b) False. 2. Let zt be a vector covariance-stationary process. covariance-stationary process.
{ }
Then each element of zt is a univariate
(a) True. (b) False. 3. Let ψj ( j = 0, 1, 2, . . . ) be a sequence of real numbers. Which one is the weakest statement about the sequence?
{ }
(a)
∞
|ψ | < ∞. 2 (b) =0 ψ < ∞. (c) |ψ | ≤ Ab for some A > 0 and 0 < b < 1 for all j = 0, 1, 2,.... 4. The filter φ(L) = 1 + 0.2L − 0.48L2 satisfies the stability (stationarity) condition. j =0
j
∞
j
j
j
j
(a) True. (b) False. 5. (This is a tricky question. Be careful in answering.) Suppose you have the following regression printout: yt = 0.12 + 0.044 t + 0.975 yt−1 + 0.5 yt−2 , R2 = 0.98,SEE = 2.44. (0.33)
(0.0022)
(0.010)
(0.034)
If you use the appropriate version of the asymptotic ADF t test, the null that y t is a random walk without drift is (a) rejected at 2.5% significance level. (b) rejected at 5% level but not at 2.5%. (c) accepted at 5%. (d) not testable with the information provided.
Part B: Short Questions (40 points) In your answer, you can take all the results from the Slides and the book as granted. However, indicate in your answer which result is being used. For example, “ xn Large Numbers”.
→
a.s.
α by Kolmogorov’s Law of
1. (8 points, 5 lines or less) Let zt be a sequence of i.i.d. (independently and identically distributed) random variables with E(zt ) = µ = 0 and Var(zt ) = σ 2 > 0, and let z n be the sample mean. What is the variance of the limiting distribution of n( z n µ)?
{ }
√ √ − √
2. (6 points, 8 lines or less) Let (x1 , x2 , . . . , xn ) be a random sample (so xt is i.i.d.) with E(xt ) = µ n 1 and Var(xt ) = σ 2 > 0. Assume µ and σ 2 are finite. Let σn2 xn )2 , where xn t=1 (xt n n 1 2 2 t=1 xt is the sample mean. Show that σn converges in probability to σ . n
{ } ≡
−
≡
3. (10 points, 6 lines or less) Let zt be a univariate random walk. Compute Corr(zt , zt+s ) (correlation of z t and z t+s ) for s > 0 and find its limit as s with t held fixed.
{ }
→∞
4. (10 points, 8 lines or less) Let εt (t = 1, 2, 3, . . . ) be a martingale difference sequence. Generate a sequence xt (t = 1, 2, 3, . . . ) by the formula: x1 = ε1 , xt = ε t εt−1 (t = 2, 3, . . . ). Show that xt is a martingale difference sequence.
{ }
{ }
{ }
·
5. (6 points, 6 lines or less) Consider a covariance-stationary VAR( p): Φ(L)yt = c + εt where Φ(L) = I Φ1 L Φ2 L2 Φ p L p . Suppose that Φ(L) satisfies the stability condition. Show that the mean of y t is Φ(1)−1 c.
−
−
−···−
2
Part C: A Long Question (30 points) Consider the bivariate VAR:
y1t y2t
= φy1,t−1 + αy2,t−1 + ε1t , = y 2,t−1 + ε2t .
Assume that φ < 1 and φ = 0.
| |
(a) (5 points, 2 lines or less) The bivariate system can be written as yt = Φ1 yt−1 + εt , where yt (2×1)
(y1t , y2t ) and εt
(2×1)
≡
≡ (ε1 , ε2 ) . Specify the 2 × 2 matrix Φ1. Just write down your answer. No need t
t
to explain. (b) (5 points, 1 line) Find the cointegrating rank. Just provide your answer. No need to explain. (c) (7 points, 2 lines or less) Find a cointegrating vector. Just provide your answer. No need to explain. (d) (5 points, 14 lines or less) Derive the VECM (Vector Error Correction Model). (e) (8 points, 14 lines or less) Derive the Vector Moving Average Representation ∆yt = Ψ( L)εt .
3
Spring 2016
GRIPS
Advanced Econometrics III, Final, Answer Sheet
Part A: Multiple Choice Questions 1. True. Let y t = ε t + θεt−1 . Then ρ 1 =
θ
1+θ 2 .
This achieves a maximum of 0.5 when θ = 1.
| |
2. (a) True. 3. (b). 4. (a) True. 5. The t statistic is -2.5, which is asymptotically DF tτ with or without drift by Proposition 9.8. The 5% critical value is -3.41. So we accept at 5%. The answe is (c).
Part B: Short Questions 1. By Lindeberg-Levy, the asymptotic variance of z is σ 2 . By the delta method, the asymptotic variance of z is σ 2 /(4µ).
√
2. By simple algebra,
1 n
σn =
n
−
n t=1
x2t
2
(xn ) ,
t=1
1 where xn xt . Since µ is finite, can apply Kolmogorov’s Strong LLN to claim that n 2 xn a.s. µ. Since σ is finite, the second moment E(x2t ) is finite as well. So can apply Kolmogorov’s n Strong LLN to claim that n1 t=1 x2t a.s. E(x2t ). Thus
→
≡
→
σn
→
3. zt = g 1 + + gt and z t+s = zt + gt+1 + t σ 2 , where σ 2 Var(gt ). So
·
·· ·
≡
p
− µ2 = σ 2.
·· · + g + . So Var(z ) = t · σ2 and Cov(z , z + ) = Var(z ) =
Corr(zt , zt+s ) = This goes to zero as s
E(x2t ) t s
t
t
t σ2 = t σ 2 (t + s)σ2
√
·
·
→ ∞.
t s
t
t . t+s
4. (xt−1 , . . . , x1 ) can be calculated from (εt−1 , . . . , ε1 ). That is, the latter has more information than the former. Thus E(xt xt−1 , . . . , x1 ) = E(εt εt−1 xt−1 , . . . , x1 )
|
|
= E [E(εt εt−1 εt−1 , . . . , ε1 ) xt−1 , . . . , x1 ]
·
|
|
= E [εt−1 E(εt εt−1 , . . . , ε1 ) xt−1 , . . . , x1 ] = 0.
|
|
(1)
5. By covariance-stationarity, E(yt ) = µ for all t, where µ E(yt ). So µ Φ1 µ Φ p µ = c, or Φ(1)µ = c. Since Φ(L) satisfies the stability condition, Φ(1) = 0. So Φ(1) is invertible.
≡ | |
4
−
−···−
Part C: The Long Question (a) In vector nation, the process is: yt = Φ1 yt−1 + εt , where Φ1 =
φ α 0 1
. So Φ(1) =
1
− φ −α 0
0
.
(b) Clearly, the rank of Φ(1) is one. So h = 1. (c) Φ(1) = BA , where B = (1, 0) , A = (1
−α
− φ, −α) . So a cointegrating vector is (1, 1
−φ
) .
(d) This should be very easy because the answer is a special case of the discussion from (10.2.22) to (10.2.28) of the text. Here, p = 1 (so Φ2 = Φ3 = = 0, ζ 1 = ζ 2 = = 0) and α = δ = 0 (so ζ t = y t ).
· ··
∆yt
≡ y − y t
· ··
t−1 = π 1 yt−1
=
t−1 +
−y
εt
−Φ(1)y 1 + ε = −BA y 1 + ε = −Bz 1 + ε , t−
t
t−
t
t−
where z t−1
t
t−1 .
≡ A y
(e) By (10.2.17), Ψ(L) = Φ(L)−1 (1
− L). Use the convolution formula to invert Φ(L) ≡ I − Φ1L. So
Ψ(L) = =
2
2
I + Φ1 L + (Φ1 ) L + (1
− L)(1 − φL)
1
−
·· · −
αL(1
0
(1
L)
− φL) 1
To derive the second equality here, use
− Φ1
j +1
Φ1
j
φj (φ 1) αφj = . 0 0
The VMA representation is: ∆yt = Ψ(L)εt .
5
−
1
−
.
Spring 2016
GRIPS
Advanced Econometrics III: Final, My Grading Policy Part A: Multiple Choice Regarding 5., I’ve decided to give full credit (of 6 points) for anybody who chose (c) or (d) or who decided not to answer the question. I am sorry that the question was not poorly worded. First of all, the numbers in parentheses are standard errors and “SEE” stands for the standard error of the equation (the sample standard deviation of the residuals). Secondly, the question would have been a lot more straightforward to answer if yt−2 in the regression were ∆yt−1 where ∆yt−1 yt−1 yt−2 . The clear answer then would have been (c).
≡
−
Part B: Short Questions 1.
2.
• 3 points for stating that √ n(z − µ) →d N (0, σ2). • 2 points for writing down the Delta method formula. • 3 points for the right answer. • 2 points for stating that σ2 →p E[(x − µ)2] without proof. • 1 if x →p µ. • I took 2 points off if I see something I don’t like (e.g., if you state that the plim of some random n
t
n
n
variable is a random variable whose distribution depends on n).
3.
• 3 points for Var(z ) = tσ 2. • 3 points for Cov(z , z + ). • 2 points for Corr (1 if you merely write down the definition of Corr). • 2 points for the limit of Corr. • 2 points for noting that (ε1,...,ε 1) has at least as much info as (x1,...,x 1). • 2 points for the Law of Iterated Expectations. • 6 points for the rest. I.e. 2 points for using E( ε |ε 1,...) = 0, 2 for noting ε 1 can be placed outside the conditional expectation, and 2 for E( ε 1 |ε 1 ,...)0. • I gave 2 points for proving the weaker result that E(x ) = 0. t
t
4.
t s
t−
t−
t
t−
t−
t t−
t−
t
5. From the full credit of 6 points, I subtracted
• 2 points if your argument does not extend to vector processes. • 1 for failing to note that |Φ(1)| = 1 is implied by stationarity. • 2 if confusion between Φ(1) and 1 − Φ(1). • 2 if confusion between y , which is random, and E(y ). t
t
6
Part C: The Long Question (d)
• 2 points for ∆y = −Φ(1)y 1 + ε . • 3 points for using Φ(1) = BA . t
t−
t
I took 1 point off if you didn’t impose p = 1 by including lagged changes. (e)
• 2 points for Ψ(L) = Φ(L) • 3 for inverting Φ(L). • 3 for the rest.
1
−
(1
− L).
7
HMM (Hidden Markov Models) Fumio Hayashi GRIPS July 22, 2016
Also called regime-switching models. Based on
Chapter 22 of Hamilton’s book, Time Series Analyais , 1994,
Fumio Hayashi
HMM Hidden Markov Models
July 22, 2016,
1
1
Markov Chains s t = {1, 2, ..., N }, Prob(s t = j |s t −1 = i , s t −2 = k , ...) = Prob(s t = j |s t −1 = i ) = p ij .
The transition matrix:
P
(N ×N )
That is,
=
p 11
p 21
p 12
p 22
.. .
.. .
p 1N
p 2N
··· ··· ··· ···
p N 1 p N 2
.. .
N
, p ij ≥ 0 , P 1 = 1 (i.e.,
p NN
p ij = 1).
j =1
Prob(s t +1 = j |s t = i ) = ( j , i ) element of P. Fairly easy to show: Prob(s t +m = j |s t = i ) = ( j , i ) element of Pm . [Explain on the board.]
Fumio Hayashi
HMM Hidden Markov Models
July 22, 2016,
2
1
Markov Chains (condinued) Suppose {s t } is stationary. What is the unconditional distribution
π
(N ×1)
= (Prob(s t = j ))?
Pπ = π . π solves
A
π = e N +1 , where A ≡
((N +1)×N )
IN − P 1
So π is the (N + 1)th column of (A A)−1 A . [Explain on the board.] Under what conditions is {s t } stationary? For N = 2, p 11 < 1, p 22 < 1, p 11 + p 22 > 0.
Fumio Hayashi
HMM Hidden Markov Models
July 22, 2016,
3
1
The HMM Model The AR1 example: y t = c (s t ) + φ(s t )y t −1 + σ (s t )εt , εt ∼ N (0, 1),
{s t } is a two-state Markov chain, s t unobservable. For
yt , its conditional distribution is given by (M ×1)
f (yt |s t = j , Y t −1 ; α),
{s t } is an N -state Markov chain, s t unobservable.
Y t ≡ ( yt , yt −1 , ..., y1 ) is the date t info set, only s t is relevant for the conditional density; ( s t −1 , s t −2 ,...) doesn’t matter. The model’s parameter vector θ consists of α and P.
In the above AR1 example,
xt = 1,
The conditional distribution is normal with mean c (s t ) + φ(s t )y t −1 and variance σ (s t )2 . α = (c (1), c (2), φ(1), φ(2), σ (1), σ (2)). θ consists of α and (p 11 , p 22 ).
Fumio Hayashi
HMM Hidden Markov Models
July 22, 2016,
4
1
η t ξ t |τ (The Hamilton Filter) ,
Definition:
f (yt |s t = 1, Y t −1 ; α)
Prob(s t = 1|Y ; θ ) .. .. η t ≡ ξ ≡ , . t | . . (N ×1) (N ×1) Prob(s t = N |Y ; θ ) f (yt |s t = N , Y t −1 ; α) τ
τ
τ
The recursion:
ξt |t =
ξt |t −1 η t
1 (ξt |t −1 η t )
, ξt +1|t = P ξt |t ,
≡element-by-element multiplication.
Given θ , a sequence (η 1 , η 2 , ..., η T ), and an initial condition ξ1|0 , you can use the above to generate (ξ1|1 , ξ2|1 , ξ2|2 , ξ3|2 , ..., ξT −1|T −1 , ξT |T −1 ).
For the derivation of the recursion, see pp. 692-693 of Hamilton.
Fumio Hayashi
HMM Hidden Markov Models
July 22, 2016,
5
1
The Likelihood Function
The usual sequential factorization: f (yT , ..., y1 ) = L(θ ) = T (yt |Y t −1 ; θ ). Note: s t is absent. t =1 log f
T t =1 f (yt |yt −1 , ..., y1 ).
So
Use the board to derive: N
f (yt |Y t −1 ; θ ) =
f (y t |s t = j , Y t −1 ; θ ) · Prob(s t = j |Y t −1 ; θ ).
j =1
Prob(s t = j |Y t −1 ; θ ) = j -th element of ξ t |t . So
f (yt |Y t −1 ; θ ) = η t ξt |t −1 .
To compute L(θ ), you need ξ1|0 . Take this as additional parameters. The number of parameters can be large. Often difficult to find the global max of L(θ ). See Hamilton, pp. 695-696 for an algorithm.
Fumio Hayashi
HMM Hidden Markov Models
July 22, 2016,
6
1
Forecasts and Smoothed Inferences for the Regime (forecasting) Fairly easy to show
ξt +m|t = P m ξt |t (m = 1, 2,...)
(smoothed probabilities) Kim’s (1993) recursion: ξt |T = ξt |t
P [ξt +1|T (÷)ξt +1|t ] ,
(÷) =element-by-element division.
Iterate backwards for t = T − 1, T − 2, ..., 1), starting with ξT |T . For example, for t = T − 1,
ξT −1|T = ξT −1|T −1
Fumio Hayashi
P [ξT |T (÷)ξT |T −1 ] .
HMM Hidden Markov Models
July 22, 2016,
7
1
This version: October 2010 Fumio Hayashi
Matrix Algebra Some Definitions • (What is a Matrix?) An m × n matrix is an array of numbers ordered into m rows and n columns: a11 a12 · · · a1n a21 a22 · · · a2n A = . (1) .. .. .. (m×n) . . ··· .
For example, a 3 × 4 matrix might be
2. 3 3 −5
· · · amn
am1
am2
4 5 1
6 −7 0.2
−1 21 3
.
(2)
The size The size of of a matrix refers to the number of rows and the number of columns. A matrix is sometimes specified by describing the element in row i , column j (the (i, j ) element): A = [aij ] .
(3)
It doesn’t matter which letters (here, i and j ) to use to subscr subscript ipt the elemen elements ts of a matrix. matrix. The matrix A could be written as [akℓ ], for example. • (Submatrices) By (Submatrices) By selecting rows and columns from a matrix, you can create a new matrix. Such a matrix is called a submatrix. submatrix. For example, example, a submatrix submatrix created created by selecting rows rows 1 and 3 and colums 2 and 3 from the above example matrix is
�
4 1
6 0.2
�
.
(4)
(Vectors) If there is only one column (n = 1), then an m × n matrix reduces to an m-dimensional • (Vectors) If column vector or vector or a column vector of order m, whereas with only one row (m = 1), the matrix reduces to an n-dimensional row vector or vector or a row vector of order n. An m -dimensional vector is sometimes called a vector a vector of order m. A vector is sometimes sometimes specified by describing describing the k-th element: an m -dimensiona -dimensionall column vector is written written as:
a1
a
(m×1)
=
.. .
= [ak ].
(5)
am
It doesn’t matter which letter (here, k ) to use to subscript subscript the elements of a vector. vector. The vector vector a could be written as [aj ], for example. A zero A zero vector or vector or null vector, vector, denoted 0 denoted 0,, is a vector whose elements elements are all zero. We say that a vector vector is not equal to a zero vector vector if at least one element element of ′ the vector is not zero. For example, a = (0 ( 0, 0, 0, 3) is not a zero vector. • (Scalars) A (Scalars) A single number ( m = 1 and n = 1) is called a scalar a scalar.. • (Note (Note on Our Notati Notation) on) Unless otherwise indicated, matrices are written in uppercase bold letters, vectors are understood to be column vectors and are written in bold lowercase letters, and scalars are in lowercase italics. 1
• (Transposition) The (Transposition) The transpose of transpose of A A = [aij ], denoted denoted A A ′ , is given by A′ = [aji ] .
(6)
For example, the transpose of
� � 2 4 3 7 2 3
is
2 4
(7)
3 2 . 7 3
(8)
The transpose of a row vector is a column vector and the transpose of a column vector is a row vector. It is easy to show that (A′ )′ = A. (9) • (Matrice (Matricess as Collection Collection of Vectors) ectors) Any matrix can be written as a collection of suitably defined defined vectors. vectors. The matrix (1) shown above can be written written as a collection of row vectors vectors placed one over another: A
(m×n)
Here,
ai′
=
a1′ .. .
ai1
where
ai = (n×1)
′ am
.. .
(i = 1, 2,...,m).
(10)
ain
is the i-th row of A. The same matrix matrix can also be written written as a collec collectio tion n of column column
(1×n)
vectors:
a1j
A
(m×n)
= [a1 · · · an ]
where
=
aj (m×1)
.. .
( j = 1, 2,...,n).
(11)
amj
Here, a Here, a j is the j -th column of A. A .
• (Square (Square Matrices and their Special Cases) If Cases) If the number of rows equals the number of columns (i.e.,if m m = n ), the matrix is said to be square of order m. It looks like:
A
(m×m)
=
a11 a21
a12 a22
am1
am2
.. .
.. .
··· ···
a1m a2m
. .. ··· . · · · amm
(12)
The diagonal elements of elements of this square matrix are ( a11 , a22 , . . . , amm ). If all element elementss above above the diagonal elements are zero, the matrix is is lower triangular. triangular. Therefore Therefore,, a lower lower triangular triangular matrix looks like: a11 0 0 ··· 0 0 ··· 0 a21 a22 A = (13) . .. .. .. .. (m×m) . . . ··· .
am 1
am2
am3
· · · amm
If all elements below the diagonal elements are zero, the matrix is upper triangular. triangular. Therefore Therefore,, a matrix is upper triangular if its transpose is lower triangular. If all elements besides the diagonal elements are zero, the square matrix is said to be diagonal. diagonal. A diagonal diagonal matrix of order m whose diagonal elements are all unity is called the identity matrix and matrix and is denoted I denoted I m or simply I simply I::
1 0 ··· 0 1 ··· Im ≡ . . .. .. · · · (m×m) 0 0 ···
0 0 .. . .
1
A square matrix A matrix A is symmetric if symmetric if A A = A = A ′ , that is, if a = j . a ij = a ji for all i, j = 1, 2, . . . , n ; i ̸ 2
(14)
Matrix Summation and Multiplication • (Sum of Two Matrices) Consider Matrices) Consider two m × n matrices:
A
(m×n)
=
a11 a21
··· ···
a12 a22
.. .
.. .
am1
a1n a2n
.. ··· . · · · amn
am2
and
=
B
(m×n)
b11 b21
b12 b22
bm1
bm2
.. .
.. .
··· ···
b1n b2n
A+B ≡
a11 + b11 a21 + b21
a12 + b12 a22 + b22
.. . am1 + b m1
(m×n)
··· ···
a1n + b 1n a2n + b 2n
. .. ··· . · · · amn + bmn
.. . am2 + b m2
. .. ··· . · · · bmn
The s The sum um of of A A and B is a matrix obtained by element-by-element addition:
(15)
(16)
For A or A + B to be well-defined, the size (the number of rows and the number of columns) of A of A must be the same as that of B. B . • (Inner Product of Two Vectors of the Same Order) The inner product of two n-dimensional n (n × 1) vectors a vectors a = = [ak ] and b and b = = [bk ] is defined to be a scalar k=1 ak bk = a 1 b1 + a2 b2 + · · · + an bn .
∑
• (Matrix Multiplication) Consider Multiplication) Consider an m × n matrix A matrix A and an n × q matrix B matrix B::
A
(m×n)
=
a11 a21
.. .
.. .
am 1
The product The product of of A
(m×n)
··· ···
a12 a22
a1n a2n
.. ··· . · · · amn
am2
and
B =
(n×q)
b11 b21
b12 b22
bn1
bn2
.. .
.. .
··· ···
b1q b2q
.. . ··· . · · · bnq
(17)
and B , written as AB as AB or or sometimes A sometimes A × B, is defined to be the following (n×q )
matrix whose row ( i, j ) element is m × q matrix
AB ≡
(m×q )
∑ ∑ ∑
n k=1 a 1k bk1
∑
n k=1 a 2k bk1
.. .
n k=1 a mk bk 1
Put differently, if ai′
n k=1 a ik bkj : n k =1 a 1k bk 2
···
n k =1 a 2k bk 2
···
∑ ∑ ∑
···
n k=1 a mk bk2
is the i-th row of A
(1×n)
.. .
(m×n)
···
and bj
n k=1 a 1k bkq
∑ ∑ ∑
n k=1 a 2k bkq
.. .
n k =1 a mk bkq
.
(18)
is the j -th column of B , the (i, j ) (n×q )
(n×1)
element of AB is AB is the inner product of two n -dimensional vectors a vectors a i and b j . We say that A that A is is post postmultiplied (or multiplied (or multiplied from right or right or right-multiplied) right-multiplied) by B by B or that B that B is pre-multiplied (or multiplied (or multiplied from left or left-multiplied) left-multiplied) by A. Notice that multiplication requires that the number of columns of A be the same as the number of rows of B. Two matrice matricess satisfying satisfying this requirement are said to be conformable; conformable; for the matrix product AB product AB to be well-defined, A well-defined, A and B must be conformabl conformable. e. Unless Unless otherwise indicated, indicated, when we write a matrix product such such as AB, AB , we take for granted that the matrices involved are conformable. • (Basic Algebra of Matrix Addition and Multiplication) It Multiplication) It follows immediately from definition that matrix matrix addition addition and multiplica multiplication tion satisfy the following following properties. (a) Addition Addition is commutative : A + B = B = B + A;
(19)
whereas multiplication is not: AB ̸ = BA . 3
(20)
Indeed, the product BA is not well-defined unless the number of columns of B equals the number of rows of A (i.e., A (i.e., unless the two matrices B matrices B and A are conformable), and even where it exists, AB exists, AB would be equal to BA to BA only in rather special cases. One special case is where an identity matrix is involved. For any m × n matrix A matrix A,, A × In = A
(21)
Im × A = A = A.
(22)
and also (b) Both addition addition and multiplic multiplication ation are associative : (A + B) + C = A = A + (B ( B + C) (AB) AB)C = A = A((BC) BC).
(23) (24)
(c) The distributive law holds: A(B + C) = AB + AC,
(A + B)C = AC = AC + BC.
(25)
(d) Here is how transposition transposition interacts interacts with addition addition and multiplica multiplications: tions: (A + B)′ = A ′ + B′ (AB) AB)′ = B ′ A′ .
(26) (27)
Multiplication by a Scalar • (Scalar Multiplication Multiplication of a Matrix) Matrix) To multiply A multiply A by by a scalar α , each element of A is A is multiplied by α : αa11 αa12 · · · αa1n αa21 αa22 · · · αa2n α × A = A × α ≡ . (28) .. .. .. (1×1) (m×n) (m×n) (1×1) . . ··· .
αam1
· · · αamn
αam2
• (Scalar Multiplication of a Vector) A Vector) A special case is where
A
(m×n)
is a vector
a . Settin Settingg
(m×1)
n=1 in (28), we obtain
α
(1×1)
×
a
(m×1)
=
a
(m×1)
× α ≡ (1×1)
αa1 αa2
.. .
αam
.
(29)
Special Cases of Matrix Multiplication Important special cases of matrix multiplication defined in (18) are as follows. • (q =1) In =1) In this case, B reduces to an n-dimensional vector and ak
be the k -th column of A
(m×n)
(m×1)
A
b
(m×n)(n×1) (m×1)
=
∑ ∑ ∑
b . Let bk be the k-th element of b
(n×1)
(k = 1, 2, . . . , n). Then
n k=1 a 1k bk n k=1 a 2k bk
.. .
n k=1 a mk bk
= a 1 b1 + a2 b2 + · · · + an bn .
(30)
That is, Ab is, Ab is a linear combination of the columns of A of A with weights given by the elements of b. 4
• (m=1, q =1) =1) This This is a special special case of the above above equatio equation n (30). (30). Since Since m = 1, A reduces to an ′ n-dimensional row vector, denoted a = (a1 , a2 , . . . , an ). Therefore, (30) reduces to (1×n)
′
a b = [a1 , a2 , . . . , an ]
(1×1)
b1 b2
.. .
bn
� n
=
ak bk = a 1 b1 + a 2 b2 + · · · + an bn .
(31)
k=1
This means that the inner product can be expressed as a matrix multiplication: the inner product of two n-dimensional vectors a vectors a and b equals a equals a ′ b. • (n=1) In =1) In this case, A case, A reduces to
a
b′ =
(m×1) (1×q)
a
(m×1)
a1 a2
.. .
am
and B and B reduces to b′ . Setting n = 1 in (18), we obtain (1×q )
[b1 , b2 , . . . , bq ] =
a1 b 1 a2 b 1
a1 b2 a2 b2
am b1
am b2
.. .
··· ···
a1 bq a2 bq
. .. ··· . · · · am bq
.. .
(32)
Linear Dependence • (Definition) Let a1 , a2 , . . . , an be a set of m-dimensional column vectors and let
A
(m×n)
be the
matrix holding those n vectors. The vectors are said to be linearly dependent if dependent if there exists an n-dimensional non-zero vector x = [xk ] such that the linear combination of those vectors x as (n×1)
weights is a zero-vector:
Ax = [a1 , a2 , . . . , an ]
(m×1)
x1 x2
.. .
xn
=
∑∑ ∑
n k=1 a 1k xk n k=1 a 2k xk
.. .
n k=1 a mk xk
= a 1 x1 + a2 x2 + · · · + an xn =
0 . (33)
(m×1)
Written out in full, this condition can be expressed as a system of m equations in n unknowns with x being the vector of unknowns:
a11 x1 + a12 x2 + · · · + a1n xn = 0, a21 x1 + a22 x2 + · · · + a2n xn = 0,
.. . am1 x1 + am2 x2 + · · · + amn xn = 0.
(34)
Therefore, the vectors (a ( a1 , a2 , . . . , an ) are linearly dependent if and only if this system of equations has a non-zero solution x solution x.. If no such vector vector exists, then we say say that the m -dimensional vectors, n in number, are are linearly independent. independent. We can define linear linear independence independence and dependence dependence for a set ′ of row ) are linearly dependent (linearly independent) if row vectors: a set of row vectors ( a1′ , a2′ , . . . , an the corresponding column vectors (a ( a1 , a2 , . . . , an ) are linearly dependent (linearly independent). • (A Necessary Condition for Linear Independence) If Independence) If m < n in the system of equations (34), that is, if there is more unknowns than equations, then we can always find a non-zero vector x satisfying those m equations (it is easy to show this), so the vectors are linearly dependent. Put differently, ferently, if m m -dimensional column vectors, n in number, comprising A are linearly independent, (m×n)
then m ≥ n.
Rank of a Matrix • (Definition) The (Definition) The rank rank of of a matrix A matrix A,, denoted rank(A rank(A), is the maximum number of column vectors comprising A comprising A that are linearly independent. It can be shown that 5
The maximum number of linearly independent rows in a matrix is the same as the maximum number of linearly independent columns . Therefore, the rank of a matrix equals the maximum number of rows as well: rank(A rank(A) = rank(A rank(A′ ).
(35)
Matrix) Let A A be • (Rank and Size of a Matrix) Let be m × n. It has n column vectors, so the rank of the matrix is at most n . It has m row vectors, so the rank is at most m . Thus we have shown rank( A ) ≤ min(m, n).
(m×n)
(36)
For example, the rank of a 3 × 5 matrix is at most 3 and the rank of a 5 × 2 matrix is at most 2. If the rank of a matrix equals the number of columns (i.e., if rank( A ) = n , then the matrix is said (m×n)
rank. If the rank of a matrix equals the number of rows, then the matrix is to be be of full column rank. of full row rank. rank. Therefore, if a matrix is of full column rank, then the number of rows must be greater greater than or equal to the number number of columns. columns. If a matrix is of full row rank, then the number of columns must be greater than or equal to the number number of rows. rows. For a square matrix of order n (n × n), we say that the square matrix is of is of full rank if rank if its rank equals n . • (Some Useful Results about the Rank) The Rank) The following results can be proved. (a) rank( rank(AB AB)) ≤ min[rank(A min[rank(A), rank(B rank(B)]. (b) Let B Let B be a square matrix of full rank. Then rank(AB rank( AB)) = rank(A rank(A) and rank(BC rank(BC)) = rank(C rank(C), provided, of course, that A and B are conformable and B and C are conforma conformable. ble. That is, multiplica multiplication tion by a full-rank full-rank matrix doesn’t change change rank.
Inverse of a Matrix Inverses can be defined, if at all, only for square matrices. • (Definition) Let (Definition) Let A = A = [aij ] be a square matrix of order n (so its size is n × n). The inverse The inverse of of A A,, ij −1 denoted A denoted A = [a ], is a square matrix of the same order satisfying the condition: AA−1 = A −1 A = I = I .
(37)
Therefore, the inverse matrix [ aij ] satisfies the following conditions: n
�
k =1
aik akj =
�
n
1 if i i = j , 0 if i = j . i ̸
and
�
aik akj =
k =1
�
1 if i i = j , 0 if i = j . i ̸
(38)
A square matrix may or may not have an inverse (below we will give a necessary and sufficient condition for a matrix to have an inverse). However, it is easy to see that if an inverse of a square matrix matrix exists, exists, it is unique unique.. A matrix matrix whose whose inve inverse rse exists exists (and therefor thereforee unique unique)) is said to be invertible. invertible . • (Some Basic Properties of Inverses) Suppose Inverses) Suppose A has the inverse A inverse A−1 . It immediately follows (n×n)
from definition that: (a) (AB (AB))−1 = B −1 A−1 , provided A provided A and B are of the same order and invertible. (b) (A (A′ )−1 = (A−1 )′ , provided A provided A is invertible. (c) If A If A is symmetric, so is its inverse (if the inverse exists). (d) (αA)−1 = α −1 A−1 , provided A provided A is invertible.
6
Determinants As in the definition of inverses, consider a square matrix A of order n . • (Definition) Unlike the inverse, the determinant of A, denoted |A| or det(A), can be defined for any square matrix. For n = 1, the determinant of a 1 × 1 matrix is the element itself (the determinant of a scalar should not be confused with the absolute value of the element). For n = 2 on, the definition proceeds recursively. (a) For n = 2, the determinant is given by the following scalar: |A| =
a11 a21
a12 a22
≡ a11 a22 − a12 a21 .
(39)
(b) For n = 3, 4, . . . , the determinant of an n × n matrix is defined recursively. Suppose the determinant has been defined for ( n − 1) × (n − 1) matrices. Define the cofactor of the (i, j ) element of an n × n matrix A by cij ≡ (−1)i+j |Aij | ,
(40)
where A ij denote the (n − 1) × (n − 1) submatrix formed by deleting row i and column j from A. The determinant of the n × n matrix A is given by |A| ≡
n
n
�
�
a1j c1j =
j =1
(−1)1+j a1j |A1j | .
(41)
j =1
For example, the determinant of a 3 × 3 matrix is
a11 a21 a31
a12 a22 a32
a13 a23 a33
= a 11
a22 a32
a23 a33
− a12
a21 a31
a23 a33
+ a13
a21 a31
a22 a32
.
(42)
A square matrix whose determinant is not zero is called non-singular. A square matrix whose determinant is zero is called singular. • (Calculating Inverses) Consider a square matrix A of order n. The adjoint of A is the n × n matrix whose (i, j ) element is c ji (not cij ), the cofactor of the ( j, i) element of A. Suppose that the matrix A is non-singular (so |A| ̸ = 0). Then its inverse exists and can be calculated by the formula: A−1 =
1 1 × adjoint of A = (−1)j +i |Aji | . |A| |A|
�
�
(43)
(44)
(Here, since |A| ̸ = 0, division by |A| is well-defined.) For example, for n = 2,
�
a11 a21
a12 a22
�
1
−
�
1 a22 = · −a21 a11 a22 − a12 a21
−a12 a11
�
.
It follows, then, that a matrix is invertible if it is non-singular (below we will see that the converse is true as well). • (Properties of Determinants) It can be shown that the following useful results hold for square matrices. (a) In the definition above, the determinant is defined in reference to the first row of A (see (41)). It can be shown that |A| can be defined in reference to any row of A: n
|A| =
�
(−1)i+j aij |Aij | .
j =1
7
(45)
(b) For a lower or upper diagonal matrix, its determinant is the product of the diagonal elements. In particular, the determinant of an identity matrix of any order is 1.
(c) α A
(n×n)
= α n |A|.
(d) |A′ | = |A|. (e) |AB| = |A| |B|, provided square matrices A and B are of the same order. (f) A is invertible (i.e., A has its inverse) if and only if A is non-singular (i.e., |A| ̸ = 0). Also, |A−1 | =
1 . |A|
(46)
[Proof: Given (43), what needs to be shown is that A is non-singular if it is invertible. Set B = A−1 in (e) above.] (g) A square matrix A is of full rank (i.e., the vectors of the matrix are linearly independent) if and only if |A| ̸ = 0 (i.e., if and only if A is non-singular). These last two results can be stated succinctly as: A
(n×n)
is of full rank ⇐⇒ A
(n×n)
is invertible ⇐⇒ A
(n×n)
is non-singular.
(47)
Definite and Semi-definite Matrices • (Definition) Consider a square matrix A of order n and let x be an n-dimensional vector. The quadratic form associated with A is x′ Ax. Quadratic forms are usually defined for symmetric matrices. For example for n = 2,
�
a A = 11 a12
� ��
a12 x1 , x = , a22 x2
x′ Ax ≡ a11 x21 + 2 a12 x1 x2 + a22 x22 .
(48)
We say that a symmetric square matrix A is (a) positive definite if x ′ Ax > 0 for all x ̸ = 0, (b) positive semi-definite (or nonnegative definite) if x ′ Ax ≥ 0 for all x, (c) negative definite if −A is positive definite, i.e., if x ′ Ax < 0 for all x ̸ = 0, (d) negative semi-definite (or nonpositive definite) if −A is positive semi-definite, i.e., if x′ Ax ≤ 0 for all x. • (Singularity and Definiteness) If a square matrix A is singular (or, equivalently, if A is not of full rank), then there exists a vector x ̸ = 0 such that Ax = 0, so x ′ Ax = 0, meaning that A is not positive definite. Thus we have shown: If a symmetric square A is positive or negative definite, then it is non-singular.
(49)
Trace of a Matrix • (Definition) The trace of a square matrix of order n (n × n matrix) is defined as the sum of the diagonal elements: trace( A ) ≡ a11 + a 22 + · · · + ann . (50) (n×n)
• (Properties of Trace) It immediately follows from definition that: (a) trace(A + B) = trace(A) + trace(B), provided A and B are both n × n matrices. (b) trace( A
B ) = trace( B
(m×n)(n×m)
A ).
(n×m)(m×n)
(c) trace(αA) = α · trace(A) where α is a scalar and A is a square matrix.
8
GRIPS, Spring 2016
ML Estimation of VARs
• Covers Hamilton’s Chapter 5 and 11, and Hayashi’s Chapter 8.7 • You don’t have to read those chapters and sections • The true parameter value indicated by “sub 0”. The sample size is T . 1. Review: ML in General The Likelihood Function
• A sample or data of size T is a realization of a stochastic process up to T . • Suppose we know that the joint frequency or density function of the sample (z , z ,..., z 1
f (z1 , z2 , ..., zT ; θ 0 ) where f (., .) is a known function.
• Viewed as a function of θ, f (z , z , ..., z ; θ) is called the likelihood function. • The log likelihood is L(θ) ≡ log f (z , z ,..., z ; θ). • The ML estimator, θ , of θ is the θ that maximizes the log likelihood. 1
2
T
1
T
2
T
0
1
2
T
) is
Example 1: i.i.d. Normal 2 0
2
′
• Example: {z } i.i.d, N (µ , σ ). θ = (µ, σ ) . • The joint density of (z , z ,...,z ) is 0
t
1
2
T
� √ �− T
f (z 1, z 2 ,...,z T ; θ 0) =
1
2πσ 02
t=1
• So the log likelihood is
� �√ �− � � − � � − n
L(θ) =
1
log
2πσ 2
t=1
T
t=1 T
1 1 log(v) + (z t 2 2v
= const.
t=1
∂ θ
2
∂ (θ ) = ∂ θ ∂ θ ′
L
1 (z t 2σ2
�� − � − � −
�� − �� − 1
(z t v
T
1 2v
t=1
T
t=1
−
1
v2
v
(z t
2
− µ)
2
.
µ)2
µ)2
2
t
1
0
µ)
− µ) (z − µ)
1 2v2
+
− µ )
�
2
1 1 log(σ 2 ) + 2 (z t 2 2σ
= const.
• 1st and 2nd derivatives of L(θ): ∂ L(θ) =
exp
exp
1 (z t 2σ02
�
(with v
,
1
− (z − µ) − (z − µ) v
1 2v2
t
2
1
v3
2
≡ σ ).
t
2
�
.
s(θ), H(θ ), I (θ0)
• L(θ): log likelihood (i.e., L(θ) ≡ log f (z , ..., z • Definitions. 1
T
; θ )). θ is K dimensional.
≡ ∂ L∂ (θθ) , ∂ L(θ) , H(θ ) ≡
(score vector) s(θ )
2
(Hessian) (Information Matrix)
∂ θ ∂ θ′ (θ 0) E s(θ 0 )s(θ 0 )′ .
� I ≡ � �
�
• It can be shown that, under a weak set of conditions,
E s(θ 0 ) = 0 ,
(Information Matrix Equality)
• For Example 1, the info matrix is I (θ ) = 0
I (θ ) = 0
� � T σ02
0
0
T 2σ04
� � −
E H(θ 0 ) .
.
Avar(θ T )
• Under a certain set of conditions, the ML estimator θ is CAN (i.e., √ T (θ −θ ) → N (0, Avar(θ
� � I
T
with
1 Avar(θ T ) = lim (θ 0 ) T →∞ T
−1
.
T
0
d
That “certain set of conditions” usually includes ergodic stationary for zt .
{ }
• Also, under those conditions, the ML estimator is most “efficient”.
The above Avar is called
the “asymptotic Cramer-Rao bound”.
� �
• For most cases, a consistent estimator of Avar(θ 1 Avar(θ ) = − H(θ ) T
T
T
T
) is
−1
=
−T × H(θ
T
) −1 .
• So the standard error for the k-th element of the estimator is the square root of (k, k) element of −H(θ ) . That is, θ − θ θ − θ = → N (0, 1). × (k, k) element of Avar(θ ) (k, k) element of −H(θ )
� T
−1
kT
1
T
k0
�
kT
T
3
k0
T
−1
d
T
))
2. Special Case: IID Observations with Covariates The Likelihood Function, s(θ ), H(θ)
• Assume: z
t
= (yt , xt ), zt iid. The joint density of (yt , xt ) is parameterized as
{ }
f (yt , xt ; θ , ψ ) = f (yt xt ; θ)
|
• The likelihood function f (y , x ,...,y 1
1
T
× f (x ; ψ),
θ and ψ not related.
t
, xT ; θ , ψ ) can be written as
T
f (y1 , x1 ,...,yT , xT ; θ, ψ ) =
� �{
f (yt , xt ; θ , ψ ) (because of independence)
t=1
}�
T
=
T
f (yt xt ; θ )
|
t=1
× f (x ; ψ) t
=
T
f (yt xt ; θ )
|
t=1
�
f (xt ; ψ ).
t=1
• So the log likelihood L(θ, ψ) can be written as T
T
� | �
L(θ, ψ) =
log f (yt xt ; θ ) +
t=1
log f (xt ; ψ ).
(*)
t=1
≡L(θ )
• If (θ , ψ) maxes L(θ, ψ), then θ maxes L(θ). L(θ) is called the conditional log likelihood. • s(θ), H(θ) for the conditional log likelihood: T
T
T
s(θ ) =
�
T
st (θ), H(θ ) =
t=1
where
st (θ )
≡
�
Ht (θ ).
t=1
∂ log f (yt xt ; θ) , ∂ θ
|
4
Ht (θ)
≡
∂ 2 log f (yt xt ; θ ) . ∂ θ∂ θ ′
|
Avar(θ T )
�
• Recall from the general case: Avar(θ • I (θ ): 0
) = lim
E H(θ 0 )
1
I (θ ) 0
T →∞ T
=
T
Ht(θ 0 ) =
E
E Ht (θ 0 )
t=1
=
t=1
T E Ht (θ 0 )
1 (θ 0 ) = T →∞ T lim
(since zt
(yt , xt ) is identically distributed).
– Since Ht(θ 0 ) is iid, we have: LLN.
1
T
H(θ 0) =
– Under a set of conditions, T 1 H(θ T ) = “uniform LLN”.
• So it’s OK to set Avar(θ
T ) to
1
E Ht (θ0 ) .
lim E Ht (θ 0) =
T →∞
• Replace population mean by sample mean. Replace θ {
.
(by the info matrix equality)
T
• So
−1
� � − � � � � � � − − � � − × ≡ � � � � I − − � � ∑ } → ∑ → � � −� �
I (θ ) = 0
T
�
H(θ T ) T
1
T
−1
5
T t=1
0
1
T
by θT .
T t=1
Ht (θ 0 )
Ht (θ T )
after all.
p
p
E Ht (θ 0 ) by the basic
E Ht (θ 0) by what’s called the
� �
Another Expression for Avar(θ T )
• Recall: I (θ ) ≡ E 0
s(θ 0 )s(θ 0 )′ . We can show
� � × � � � � � �� � � � � � { } �� � � � { } � � � ∑ � E s(θ 0 )s(θ 0 )′ = T E st (θ 0 )st (θ 0 )′ .
• Proof:
�
E s(θ 0)s(θ0 )′ = Var s(θ 0 )
(b/c E(s(θ 0 ) = 0 )
T
T
st (θ 0 )
= Var
(b/c s(θ ) =
t=1
st (θ ))
t=1
T
Var st (θ 0 ) (b/c st (θ 0 ) is independent, hence serially uncorrelated)
=
t=1 T
E st (θ0 )st (θ 0 )′
=
(b/c (it can be shown that) E (st (θ 0 )) = 0 )
t=1
= T E st (θ 0 )st (θ 0 )′
•
So,
1
(θ0 ) = E st (θ 0 )st (θ 0 )′ . Another choice for Avar(θ T ) is
I
T
(b/c st (θ 0 ) is identically distributed).
6
1
T
T t=1
st (θ T )st (θ T )′
−1
.
Example 2: Linear Regression Model
∑
T t=1
• (to reproduce) L(θ) = log f (y |x ; θ). • The standard linear regression model is y = x β + ε , ε |x ∼ N (0, σ ), {y , x } is iid., θ ≡ ( β , σ ). t
t
2 0
′
t
′
2 0
• y |x ∼ N (x β , σ ). t
t
t
0
t
0 t (1×K )(K ×1)
t
f (yt xt ; θ ) =
|
t
√
1
exp
2πσ 2
t
�−
2
t
1 (yt 2σ 2
′
− x β)
−12 log(2π) − 21 log(σ ) − 2σ1 2
so log f (yt xt ; θ ) =
|
L(θ) = −
T log(2π) 2
� � ∑ ∑
T log(σ2 ) 2
−
−
1 2σ2
2
t
2
�
,
(yt
′
2
− x β) , t
T
�
(yt
t=1
′
2
− x β) . t
2
• Let (β, σ ) be the ML estimator. T t=1
– β is OLS: β =
xtx′t
−1
T t=1
xt yt .
– σ 2 = sum of squared residuals/T .
• You should try to show the following: – What is lim I (θ )? Answer: 1
T →∞ T
0
∑
1 σ
2 0
E(xt x′t ) (K ×K )
0′
(1×K )
0
(K ×1) 1 2σ04
.
– What is Avar(θ T )? Answer:
1 1
σ 2 T
T t=1 t (K ×K )
0′
(1×K )
≡
x x′t
−1
0
(K ×1) 1 2σ 4
=
� �∑ � � √ ×
– What is the standard error for β k ? Answer: x′1 .. .) ′ where X X′ X = T . t=1 xt xt (T ×K ) x′T
∑
.
7
σ 2 T
T t=1
0′
σ2
xt x′t
−1
0 . 4 2σ
(k, k) element of (X′ X)−1. (Note:
3. ML for Serially Correlated Observations Probability Theory Review: Sequential Factorization of Joint Density
• T + 1 random variables (y , y ,...,y ,...,y 0
1
t
T
). yt is a scalar. What is the joint density function
f (y0 , y1 ,...,yt ,...,yT )?
• Recall: f (y , y ) = f (y |y )f (y ). The joint is conditional times marginal. • Similarly, f (y , y , y ) = f (y |y , y )f (y , y ). • Combine the two to obtain: f (y , y , y ) = f (y |y , y )f (y |y )f (y ). • A pattern is set. f (y , y ,...,y ,...,y ) = f (y |y , y ,...,y ) × f (y ). • For now, consider the case where f (y |y , y ,...,y ) = f (y |y ) (i.e., {y } is 1st-order 1
0
2
1
1
0
0
0
2
1
0
2
0
1
t
1
1
0
0
2
∏
T t=1
T
t
t−1
1
t
t−2
0
1
t−1
t−2
0
0
0
0
t
0
t−1
Markov), so
T
f (y0 , y1 ,...,yt ,...,yT ) =
� t=1
8
f (yt yt−1 )
|
× f (y ). 0
t
The Likelihood Function, s(θ ), H(θ)
• The log likelihood function for this special case is T
� |
L(θ, ψ) =
log f (yt yt−1 ; θ ) +log f (y0 ; ψ ).
(**)
t=1
≡L(θ )
Compare this with ( ).
∗
• If the sample size T is large, the second term on the RHS is negligible.
So the ML estimator of θ is essentially the θ that maximizes the y0 -conditional log likelihood (θ ). This is true even if θ and ψ are related.
• The log likelihood function is T
L(θ) =
�
log f (yt xt ; θ ) with xt
|
t=1
≡ y
t −1
.
The same (log) likelihood as in Section 2!
• The score and Hessian are the same as in Section 2: T
s(θ ) =
�
T
st (θ), H(θ ) =
t=1
where
st (θ )
≡
�
Ht (θ ),
t=1
∂ log f (yt xt ; θ) , ∂ θ
|
9
Ht (θ)
≡
∂ 2 log f (yt xt ; θ ) . ∂ θ∂ θ ′
|
L
s(θ0 ) is mds
• About s (θ ) – it can be shown: – {s (θ )} is mds. Here is the difference from Section 2. – The key is to show first that E(s (θ )|y , y ,...) = 0 . Use the board to prove (you don’t 0
t
t
0
t
0
t−1
t−2
have to prove this, but you will be surprised that you can actually do it if you give it a try).
Avar(θ T )
• Under ergodic stationarity for {y }, the same as in Section 2, except: t
– Replace “basic LLN” by “ergodic theorem”. – Replace “iid” for st (θ0 ) by “mds” for st (θ 0 ).
10
Example 3: AR1 with normal errors: yt = c0 + φ0yt−1 + εt , {εt } iid, N (0, σ02)
• If | |φ | < 1. < 1. then {y } is stationary and ergodic. • Define x ≡ (1, (1 , y ) , β ≡ (c ( c , φ ) . Then 0
t
t
′
t−1
0
0
0
′
– yt = x t′ β 0 + ε + εt , – yt xt
′
2 0
| ∼ N ( N (x β , σ ), so 0
t
log f ( f (yt xt ; θ) =
|
1 1 − 12 log(2π log(2π) − log(σ log(σ ) − 2 2σ 2
(y 2 t
′
t
– θ = (β , σ 2 ).
∑
T t=1
f (y |x ; θ ) • (to reproduce) L(θ) = log f ( • f ( f (y |1, y ) = f ( f (y |y ), so t
t −1
t
t
t
≡ y
with xt
t−1
.
t−1
L(θ) = −
T log(2π log(2π ) 2
−
T log(σ log(σ2 ) 2
• The same as in Example 2!
11
−
1 2σ 2
T
�
(yt
t=1
′
2
− x β) . t
2
− x β) .
Sequential Factorization with p Lags
• So far, have assumed first-order Markov. • Go back to the sequential factorization argument. f ( f (y , y ,...,y ,...,y 0
1
t
T ) =
f ( f (y0 ).
∏
T t=1
f ( f (yt yt−1 , yt−2 ,...,y0 )
|
• The same trick (sequential factorization) yields T
f ( f (y− p+1, y− p+2,...,y0 , y1 ,...,yt ,...,yT ) =
�
f ( f (yt yt−1 , yt−2 ,...,y0 ,...,y− p+1 )
t=1
(Just replace y0 by y0 ,...,y− p+1 .)
• Assume {y } is p-th order Markov, so that f ( f (y |y , y ,...,y ,...,y
|
× f ( f (y ,...,y 0
− p+1
).
t
t
• Define L(θ) ≡
∑
T t=1
t−1
t−2
0
− p+1
log f ( f (yt xt ; θ ), xt
) = f ( f (yt yt−1 , yt−2 ,...,yt− p ).
|
≡ (y ( y ,...,y ) and define (as before) ∂ log log f ( f (y |x ; θ ) ∂ log f ( f (y |x ; θ ) s (θ ) ≡ , H (θ ) ≡ . |
t−1
t
t
t− p
2
t
t
∂ θ
t
∂ θ ∂ θ
t
′
the p-th -th order case, with the conditioning • All the results about the 1st order case carries over to the p set “y “yt−1 ” replaced by “y “yt−1 ,...,yt− p ”.
12
×
Exampl Exam plee 4: AR( AR( p) with normal errors, yt = c0 + φ0,1yt−1 + · · · + φ0,p yt− p + εt , {εt } iid, N (0, σ02) (φ • Under the stationarity/stability condition on (φ
0,1 ,...,φ0,p ),
the process yt is ergodic station-
{ }
ary.
(1 , y • Define x ≡ (1, t
t−1
,...,yt− p )′ , β 0
( c , φ ≡ (c 0
0,1 ,...,φ0,p )
′
. Then
+ εt , – yt = x t′ β 0 + ε
– yt xt
′
2 0
| ∼ N ( N (x β , σ ). t
0
• The log likelihood is L(θ) = −
T log(2π log(2π ) 2
−
T log(σ log(σ2 ) 2
• The same as in Example 3!
13
−
1 2σ 2
T
�
(yt
t=1
′
2
− x β) . t
Extending to Vectors
• All the results carry over when {y } is a vector process. t
T
L(θ) = st (θ )
≡
�
log f (yt xt ; θ), xt
|
t=1
∂ log f (yt xt ; θ ) , Ht (θ ) ∂ θ
|
14
≡ (y ≡
t−1
, ..., yt− p ),
∂ 2 log f (yt xt ; θ ) . ∂ θ ∂ θ′
|
4. ML Estimation of VARs This section deals with the ML estimation of VAR( p):
y
t (M ×1)
= c0 + Φ0,1 yt−1 + (M ×1)
(M ×M )(M ×1)
·· · +
Φ0,p yt− p + εt ,
{ε } iid, N (
(M ×M )(M ×1)
0 , Ω0 ). (M ×1) (M ×M )
t
(M ×1)
The Likelihood Function
• Under the stationarity/stability condition on (Φ
0,1 , ..., Φ0,p ),
the process yt is ergodic station-
{ }
ary.
• Write the M equations as Π′0
yt = (M ×1)
Π′0
′
t
t
Ω0 ). 0 t (M ×1) (M ×M )
L(θ) = − |
···
Φ0,p ,
(M ×1)
≡
xt ((1+M p)×1)
1
yt−1 yt−2 . .. . yt− p
≡ (Π, Ω))
1 1 log(2π) − log(|Ω|) − (y − Π x ) Ω − M 2 2 2 ′
t
| − log(|Ω|).]
|
15
1 2
′
t
−1
′
(yt
− Π x ).
(yt
− Π x )}.
t
T
� |− {
MT T log(2π) + log( Ω−1 ) 2 2
[Note: log( Ω−1 ) =
+ εt ,
So (with θ
log f (yt xt ; θ) =
|
≡
c0 Φ0,1 Φ0,2
(M ×(1+M p))
• y |x ∼ N (Π x ,
xt
(M ×(1+M p)) ((1+M p)×1)
(yt
t=1
′
′
−Πx )Ω t
−1
′
t
• The last term can be written as 1 2
T
�
( yt
t=1
′
Ω(Π)
(M ×M )
(Show on the board.)
′
−Πx )Ω t
≡
−1
1 T
( yt
− Π x ) = T 2 trace[Ω ′
t
−1
T
�
(yt
t=1
′
′
Ω(Π)],
′
− Π x )(y − Π x ) . t
t
t
• So the log likelihood can be rewritten as T T L(Π, Ω) = − MT log(2π) + log(|Ω |) − trace[Ω 2 2 2 −1
−1
Ω(Π)].
(*)
The ML estimate of (Π0, Ω0 ) is the (Π, Ω) that maximizes this objective function. The parameter space for (Π0 , Ω0 ) is
{(Π, Ω) | Ω is symmetric and positive definite}.
16
ML Estimate is OLS
• Proceed in two steps. (a) The first step is to maximize the obective function ( ) with respect to Ω, taking Π as given. For this purpose, the following fact from matrix algebra is useful.
∗
An Inequality Involving Trace and Determinant: Let A and B be two
symmetric and positive definite matrices of the same size. Then the function f (A)
≡ log(|A|) − trace (AB)
is maximized uniquely by A = B −1 .
This result, with A = Ω −1 and B = Ω(Π), immediately implies that the objective function ( ) is maximized uniquely by Ω = Ω(Π), given Π. Substituting this Ω into ( ) gives the concentrated log likelihood function (concentrated with respect to Ω):
∗
∗
∗
L (Π) ≡ L(Π, Ω(Π)) =
− M2T log(2π) + T 2 log(|Ω(Π) |) − T 2 trace[Ω(Π)
=
− M2T log(2π) − T 2 log( Ω(Π)|) − M2T .
(b) The ML estimator of Π0 should minimize
|
−1
� − − � � � � �
1 Ω(Π) = T
−1
T
′
(yt
Π xt )(yt
′
′
This is minimized by the OLS estimator Π given by T
′
xt xt
t=1 (K ×K )
−1
1 T
Π xt ) .
t=1
1 Π = (K ×M ) T
Ω(Π)]
T
′
�
xt yt .
t=1 (K ×M )
(For a proof, see Analytical Exercise 1 to Chapter 8 of Hayashi.) The ML estimator of Ω 0 is Ω(Π).
17
Revised 2002/10/10 F. Hayashi
Econometrics
Mixing Linear Algebra with Calculus and Probability Theory Gradients and Hessians • (Gradients) Let f :
is defined. Let
∂f (x) ∂x j
K
That is, associated with each x ∈ RK , a real number f (x) be the partial derivative of f (x) with respect to the j -th argument xj . R
R.
→
The gradient of f evaluated at x ∈ RK , denoted ∂f ∂ (xx) or Df (x), is a K -dimensional vector (x) whose j -th element is ∂f . Our convention is that, unless otherwise indicated, vectors are ∂x j column vectors. Thus the gradient vector can be written as
∂f (x) ≡ Df (x) ≡ ∂ x (K ×1)
Its transpose is denoted as
∂f (x) ∂x 1 ∂f (x) ∂x 2
.. . ∂f (x) ∂x K
(1)
.
∂f (x) : ∂ x
∂f (x) ≡ ∂ x (1×K )
∂f (x) ∂x 1
∂f (x) ∂x 2
...
∂f (x) ∂x K
(2)
,
(x) The transposition in the denominator of ∂f (indicated by “ ”) signals that the vector is a ∂ x row vector.
A point x∗ ∈ RK is said to be a critical point of f if ∂f (x ) = 0. Critical points may or may not exist. There may be more than one critical point ∂ x if a critical point exists at all.
• (Critical Points) Let f :
R
K
→
R.
∗
• (Hessians) Let f : RK → R. The Hessian of f (x) is a K × K matrix
∂ 2 f (x) ∂ x∂ x
(K ×K )
where
∂ 2 f (x) ∂x i ∂x j
≡ D2 f (x) ≡
∂ 2 f (x) ∂x 2 1
···
∂ 2 f (x) ∂x 1 ∂x K
∂ 2 f (x) ∂x 2 ∂x 1
···
∂ 2 f (x) ∂x 2 ∂x K
.. . ∂ 2 f (x) ∂x K ∂x 1
.. . ···
∂ 2 f (x) ∂x K ∂x K
,
(3)
is the second partial derivative of f (x) with respect to x i and x j . If the function ∂ 2 f
∂ 2 f
(x) (x) f is twice continuously differentiable, then ∂x i ∂x j = ∂x j ∂x i , so the Hessian is symmetric. If K = 1, the Hessian of f reduces to the second derivative of f .
1
• (Gradients and Hessians of Linear Functions) A linear mapping from RK to R can be represented as: f (x) ≡ a x, where a and x are K × 1. For example for K = 2, (K ×1)
(K ×1)
a x = a 1 x1 + a2 x2 where
a = [a1 , a2 ] and x = [x1 , x2 ].
It is easy to show from the definition that the gradient and the Hessian of a x are ∂ (a x) = a ∂ x (K ×1)
∂ 2 (a x) = 0 . ∂ x∂ x (K ×K )
and
(4)
Since a x = x a (this is because the transpose of a scalar is the scalar itself), we also have: ∂ (x a) = a ∂ x (K ×1)
∂ 2 (x a) = 0 . ∂ x∂ x (K ×K )
and
• (Gradients and Hessians of Quadratic Functions) Let
x
(K ×1)
(5)
be K × 1 and
A
(K ×K )
be
a K × K square (but not necessarily symmetric) matrix. Consider a quadratic function f (x) ≡ x Ax. It is a mapping from RK to R. For example for K = 2, x
Ax = a 11 x21 + a12 x1 x2 + a21 x2 x1 + a22 x22 ,
where
a A = 11 a21
a12 a22
and x =
x1 . x2
It is easy to show from the definition that the gradient and the Hessian are ∂ (x Ax) = (A + A )x ∂ x
and
∂ 2 (x Ax) = A + A , ∂ x∂ x
(6)
• (Quadratic Forms) If the matrix A in x Ax is a symmetric matrix, the function is called
the quadratic form associated with A. Setting A = A in (6), we have, for a quadratic form, ∂ (x Ax) = 2Ax ∂ x
and
∂ 2 (x Ax) = 2A, ∂ x∂ x
(7)
Concave and Convex Functions • (Concavity and Strict Concavity) Consider a function f :
R
K
→
R.
The function f is
said to be concave if for all x , y ∈ RK and for all t ∈ [0, 1].
f (tx + (1 − t)y) ≥ tf (x) + (1 − t)f (y)
(8)
The function is said to be strictly concave if f (tx + (1 − t)y) > tf (x) + (1 − t)f (y)
for all x , y ∈ RK , x = y and for all t ∈ (0, 1). (9)
• (Convexity and Strict Convexity) A function f : RK → R is said to be convex (strictly convex) if − f is concave (strictly concave). Therefore, the function is convex if f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y)
for all x , y ∈ RK and for all t ∈ [0, 1],
(10)
and strictly convex if f (tx + (1 − t)y) < tf (x) + (1 − t)f (y)
for all x , y ∈ RK , x = y and for all t ∈ (0, 1). (11) 2
• Proposition 1 (Concave/Convex Functions and Hessians): Consider a function f : K R → R. A necessary and sufficient condition for the function to be concave is that its Hessian be negative semi-definite for all x ∈ RK . A necessary and sufficient condition for the function to be convex is that its Hessian be positive semi-definite for all x ∈ RK . • Proposition 2 (Sufficiency for Strict Concavity/Convexity): Consider a function f : RK → R. A sufficient (but not necessary) condition for the function to be strictly concave is that its Hessian be negative definite for all x ∈ RK .1 A sufficient (but not necessary)
condition for the function to be strictly convex is that its Hessian be positive definite for all x ∈ RK . • Proposition 3 (Global Maximum/Minimum for Concave/Convex Functions): Consider a function f : RK → R. Suppose that there exists a critical point x ∗ (so ∂f ∂ (xx ) = 0). ∗
(a) If the function is concave, then x∗ is a global maximizer of f (that is, f (x∗ ) ≥ f (x) for all x ∈ RK ). If, furthermore, the function is strictly concave, then x∗ is the unique critical point (that is, there is only one x∗ that satisfies ∂f ∂ (xx ) = 0) and is the unique global maximizer (that is, f (x∗ > f (x) for all x = x ∗ ). ∗
(b) If the function is convex, then x∗ is a global minimizer. If, furthermore, the function is strictly convex, then x∗ is the unique critical point and is the unique global minimizer.
Minimizing/Maximizing Quadratic Functions • (The Quadratic Objective Function) As an application, consider the optimization prob-
lem of maximizing (or minimizing as the case may be) a quadratic function: f (x) ≡ − b
x
(1×K )(K ×1)
+
1 x A x , 2 (1×K )(K ×K )(K ×1)
(12)
where A is a symmetric matrix. • (The Critical Point) We know from calculus that the FOCs (first-order conditions) for the (x) optimization problem are that ∂f = 0 for all i = 1, 2, . . . , K. Using (4) and (7), those K ∂x i
conditions can be written as
−b + Ax =
0 .
(K ×1)
(13)
This is a system of K equations in K unknowns (x1 , x2 , . . . , xK ). We know from matrix algebra that if A is non-singular, then A has an inverse and the system has a unique solution: x∗ = A −1 b.
(14)
Thus a critical point exists and is unique (only one critical point) if A is non-singular. • Suppose A is negative definite, not just non-singular. Then the quadratic function f (x) in
(12) is strictly concave, according to Proposition 2 above. Thus, by Proposition 3, x ∗ = A−1 b is the unique global maximizer. If A is positive definite, then the quadratic function is strictly convex and x ∗ = A −1 b is the unique global minimizer. 1
The condition is only necessary but not sufficient. Here is an example where a function is strictly concave yet its Hessian is not negative definite: Let K = 1 and consider a function f (x) = −x4 . This function is strictly concave but its Hessian (the second derivative) at x = 0 is zero.
3
Jacobian Derivatives • (Jacobian Derivatives and Jacobians) For each j ( j = 1, 2, . . . , N ), let f j :
K
R
→ R and
define
f (x) ≡ (N ×1)
f 1 (x) f 2
(x) .. . (x)
.
(15)
f N
Thus we have f : RK → RN . The N × K matrix denoted by Df (x) or
Df (x) ≡ (N ×K )
∂ f (x) ≡ ∂ x
(N ×K )
∂f 1 (x) ∂x 1
···
∂f 1 (x) ∂x K
.. .
.. .
∂f N (x) ∂x 1
···
∂f N (x) ∂x K
=
∂ f (x) ∂ x
∂f 1 (x) ∂ x (1×K )
.. . ∂f N (x) ∂ x (1×K )
(16)
is called the derivative or the Jacobian matrix or the Jacobian derivative of f at x. If N = K , then the Jacobian derivative is a square matrix. Its determinant is called the Jacobian determinant or simply the Jacobian. • Consider a linear mapping from RK to RN : f (x) ≡
x . It is immediate from definition
A
(N ×K )(K ×1)
that D(Ax) =
∂ Ax = A . ∂ x
(17)
Taking Expectation and Variance of Random Vectors and Matrices • Let x ≡
be a ... x1 x2
K -dimensional random vector (a vector whose elements are random
xK
numbers). The expectation of x is simply a K -dimensional vector whose i -th element is the expectation of x i :
E( ) E( ) ... x1 x2
E(x) ≡ (K ×1)
E(xK )
Clearly, E(x ) = E(x) .
4
.
(18)
• The variance matrix (or variance-covariance matrix or covariance matrix) of a K -
dimensional random vector x is
Var(x) ≡ (K ×K )
Var( Cov( ...
)
x1 )
· · · Cov(x1 , xK )
x2 , x1 )
· · · Cov(x2 , xK
.. .
Cov(xK , x1 ) · · · Var(xK )
.
(19)
Since Cov(xi , xj ) = Cov(xj , xi ), the variance matrix is symmetric. • Let
X
(K ×M )
be a K × M random matrix whose ( i, j ) element is xij . Its expectation is simply
a K × M matrix whose (i, j ) element is the expectation of the ( i, j ) element of X:
E(X) ≡ (K ×M )
E( E( ...
)
x11 )
···
E(x1M )
x21 )
···
E(x2M .. .
≡ [E(xij )] .
(20)
E(xK 1 ) · · · E(xKM )
• Let
X
(K ×K )
be a square random matrix. Recall from matrix algebra that the trace of a square
matrix is defined to be the sum of its diagonal elements. Thus E[trace(X)] = E(x11 + x22 + · · · + xKK ) (by the definition of the trace operator) = E(x11 ) + E(x22 ) + · · · + E(xKK ) (by the linearity of expectations) (21) = trace[E(X)]. Therefore, the trace and expectation operators can be interchanged. • Let
x
(K ×1)
and
y
be two random vectors. The covariance between x and y is a K × L
(L×1)
matrix whose (i, j ) element is Cov( xi , yj ):
Cov(x, y) ≡ (K ×L)
Cov( Cov( ...
)
x1 , y1 )
· · · Cov(x1 , yL )
x2 , y1 )
· · · Cov(x2 , yL
.. .
.
(22)
Cov(xK , y1 ) · · · Cov(xK , yL )
Recalling that Cov(x, y ) = E(x · y ) − E(x) E(y) for two random variables x, y , this covariance matrix can be written as Cov(x, y) = E(xy ) − E(x) E(y) . (K ×L)
(K ×L)
(23)
(K ×1) (1×L)
By definition, Cov(x, y) = Cov(y, x). • Let
A
(N ×K )
be a N × K matrix of constants and let
x
(K ×1)
be a K -dimensional random variable.
It is easy to show from the definition that E(Ax) = (N ×1)
A
E(x)
(N ×K )(K ×1)
and
Var(Ax) = (N ×N )
5
A Var(x) A .
(N ×K ) (K ×K ) (K ×N )
(24)
• Let X be a random matrix, and let A and B be matrices of constants. It is easy to show that
E(AX) = A E(X), E(XB) = E(X)B and
E(AXB) = A E(X)B,
(25)
provided, of course, that the three matrices A, X, B are conformable (so that the products AX and XB can be defined). • Let
x
(K ×1)
be a random vector and
A
(K ×K )
be a K × K matrix of constants. What is the
expectation of the quadratic form? Answer: E(x Ax) = E[trace(x Ax)] = E[trace(xx A)] = trace[E(xx A)] = trace[E(xx )A]
(since the trace of a scalar is the scalar itself) (since trace(AB) = trace(BA)) (by (21)) (by (25)).
(26)
Exercises 1. Verify (4) for K = 2. 2. Verify (6) for K = 2. 3. Derive (13). 4. Verify (17) for K = 2 , N = 3. 5. Prove (23). 6. Verify (24) for K = 2 , N = 3. 7. Verify that E(AXB) = A E(X)B (which is the third claim in (25)) for your choice of the sizes of A , B, X. 8. Let
e
(N ×1)
≡
y
−
(N ×1)
b . Viewed as a function of b, e e is a mapping from
X
R
K
(N ×K )(K ×1)
Use (4) and (7) to calculate
to
R.
∂ (e e) . Hint: ∂ b
e e = (y − Xb) (y − Xb) = (y − (Xb) )(y − Xb) (since (A + B) = A + B for any two matrices A , B of the same size) = (y − b X )(y − Xb) (since (AB) = B A for any two conformable matrices A , B) = y y − b X y − y Xb + b X Xb (since (A + B)(C + D) = AC + BC + AD + BD) = y y − 2y Xb + b X Xb (since b X y = y Xb).
9. A result from matrix algebra: If X is of full column rank, then X X is positive definite. Prove this. Show that the unique minimizer of e e in the previous question is: b = (X X)−1 X y. 10. (Optional) Let f : RK → RN , g : R N → RM . Let h ≡ g ◦ f be the composite of functions f and g defined by the relation h(x) = g(f (x)) for x ∈ RK . So h : RK → RM . Prove the chain rule: Dh(x) = D g(f (x)) D f (x) (M ×K )
(M ×N )
(N ×K )
or
∂ h(x) ∂ g(f (x)) ∂ f (x) = ∂ x ∂ y ∂ x
(M ×K )
6
(M ×N ) (N ×K )
with y = f (x).
(27)
11. (Optional) Derive the result you derived in Question 8 above by first calculating then using the chain rule. Hint: Set M = 1 in (27) to obtain ∂ (e e) ∂ (e e) ∂ e = . ∂ b ∂ e ∂ b (1×K )
(1×N ) (N ×K )
7
∂ (e e) ∂ e
and
(28)
GRIPS Advanced Econometrics III Spring 2016
Problem Set # 1 In the data file MISHKIN.XLS, monthly data are provided on: Column 1: year Column 2: month Column 3: one-month
inflation rate (in percent, annual rate; call this PAI1 )
Column 4: three-month Column 5: one-month
T-bill rate (in percent, annual rate; call this TB1 )
Column 6: three-month Column 7: CPI
inflation rate (in percent, annual rate; call this PAI3 )
T-bill rate (in percent, annual rate; call this TB3 )
for urban consumers, all items (the 1982–1984 average is set to 100; call this CPI ).
The sample period is February 1950 to December 1990 (491 observations). The data on PAI1 , PAI3 , TB1 , and TB3 are the same data used in Mishkin (1992) and were made available to us by him. The T-bill data were obtained from the Center for Research in Security Prices (CRSP) at the University of Chicago. The T-bill rates for the month are as of the last business day of the previous month (and so can be taken for the interest rates at the beginning of the month). In this exercise, we won’t use PAI1 , PAI3 , or TB3 ), so those two series won’t concern us. (a) (Library/internet work) To check the accuracy of the interest rate data in MISHKIN.ASC, visit the web sites of the Board of Governors (http://www.federalreserve.gov/). Can you find onemonth T-bill rates? [Answer: Probably not.] (b) (Library/internet work) Visit the web site of the Bureau of Labor Statistics, (www.bls.gov) to verify that the CPI figures in MISHKIN.ASC are correct. Verify that the timing of the variable is such that a January CPI observation is the CPI for the month in MISHKIN.ASC. Regarding the definition of the CPI, verify the following. (1) The CPI is for urban consumers, for all items including food and housing, and is not seasonally adjusted. (2) Prices of the component items of the index are sampled throughout the month. When is the CPI for the month announced? (c) Is the CPI a fixed-weight index or a variable-weight index? The one-month T-bill rate for month t in the data set is for the period from the beginning of month t to the end of the month (as you just verified). Ideally, if we had data on the price level at the beginning of each period, we would calculate the inflation rate for the same period as ( P t+1 − P t )/P t , where P t is the beginning-of-the-period price level. We use CPI for month t − 1 for P t (i.e., set 1
P t = CPI t−1 ). Since the CPI component items are collected at different times during the month, there arises the inevitable misalignment of the inflation measure and the interest-rate data. Another problem is the timing of the release of the CPI figure. The efficient market theory assumes that P t is known to the market at the beginning of month t when the T-bill rates for the month are set. However, the CPI for month t − 1, which we take to be P t , is not announced until sometime in the following month (month t). Thus we are assuming that people know the CPI for month t − 1 at the beginning of month t. (d) Reproduce the results in Table 2.1. The sample period should be the same as in the table, namely, January 1953-July 1971. Because the T-bill rate is in percent and at an annual rate, the inflation rate must be measured in the same unit. Calculate π t+1 , which is to be matched with TB1t (the one-month T-bill rate for month t), as the continuously compounded rate in percent:
P t+1
12
P t
−
1
×
100.
(e) Can you reproduce (2.11.9), which has robust standard errors? What is the interpretation of the intercept? (f) (Estimation under conditional homoskedasticity) Test market efficiency by regressing πt+1 on a constant and TB1t under conditional homoskedasticity . Compare your results with those in (2.11.9). Which part is different? (g) (Breusch-Godfrey test) For the specification in (f), conduct the Breusch-Godfrey test for serial correlation with p = 12. (The nR2 statistic should be about 27.0.) Let et (t = 1, 2, . . . , n) be the OLS residual from the regression in (f). To perform the Breusch-Godfrey test as described in the text, we need to set et (t = 0, −1, . . . , −11) to zero in running the auxiliary regression for t = 1, 2, . . . , n. (h) (effect of seasonal adjustment) So far, the CPI series used is not seasonally adjusted. Obtain the official BLS (Bureau of Labor Statistics) seasonally adjusted data by visiting the BLS website. The CPI value for December 1990, for example, should be 134.2 (seasonally adjusted), not 133.8 (not seasonally adjusted). Prepare Table 2.1, this time using the seasonally adjusted data. How does the use of seasonally adjusted data change results?
2
GRIPS Advanced Econometrics III Spring 2016
Problem Set # 2 Data files DM.ASC (for the Deutsche Mark), POUND.ASC (British Pound), and YEN.ASC (Japanese Yen) contain weekly data on the following items:
• the date of the observation (e.g., “19850104” is January 4, 1985) • the ask price of the dollar in units of the foreign currency in the spot market on Friday of the current week (S t )
• the ask price of the dollar in units of the foreign currency in the 30-day forward market on Friday of the current week (F t )
• the bid price of the dollar in units of the foreign currency in the spot market on the delivery date on a current forward contract (S 30t ). The sample period is the first week of 1975 through the last week of 1989. The sample size is 778. As in the text, define st ≡ log(S t ), f t ≡ log(F t ), s30t ≡ log(S 30t ). If I t is the information available on the Friday of week t , it includes { st , st−1 , . . . , ft , f t−1, . . . , s30t−5 , s30t−6 , . . .}. Note that s 30t is not observed until after the Friday of week t + 4. Define εt ≡ s 30t − f t . Pick your favorite currency to answer the following questions. (a) (Library/internet work) For the foreign currency of your choice, identify the week when the absolute value of the forward premium is largest. For that week, find some measure of the domestic one-month interest rate (e.g., the one-month CD rate) for the United States and the currency’s home country, to verify that the interest rate differential is as large as is indicated in the forward premium. (b) (Correlogram of { εt }) Draw the sample correlogram of ε t with 40 lags. Does the autocorrelation appear to vanish after 4 lags? (It is up to you to decide whether to subtract the sample mean in the calculation of sample correlations. Theory says the population mean is zero, which you might want to impose in the calculation. In the calculation of the correlogram for the yen/$ exchange rate shown in Figure 6.2, the sample mean was subtracted.)
1
(c) (Is the log spot rate a random walk?) Draw the sample correlogram of st+1 − s t with 40 lags. For those 40 autocorrelations, use the Box-Ljung statistic to test for serial correlation. Can you reject the hypothesis that {st } is a random walk with drift? (d) (Unconditional test) Carry out the unconditional test. Can you replicate the results of Table 6.1 for the currency? (e) (Regression test with truncated kernel) Carry out the regression test. Can you replicate the results of Table 6.2 for the currency? (f) (Bartlett kernel) Use the Bartlett kernel-based estimator of S to do the regression test. Newey and West (1994) provide a data-dependent automatic bandwidth selection procedure. Take for granted that the autocovariance lag length determined by this procedure is 12 for yen/$ (so autocovariances up to the twelfth are included in the calculation of S), 8 for DM/$, and 16 for
Pound/$. The standard error for the f − s coefficient for yen/$ should be 0.6815.
2
Class Name: Advanced Econometrics III Course Number: ECO6720E Course Homepage: https://sites.google.com/site/fumiohayashi/teaching/advanced-econometrics-iii Course instructor(Full Name): HAYASHI, Fumio Academic Year(April - March of the next year): 2016 Term: Spring
1.
Course Description:
This is a course on time series.
2.
The topics covered include: ARMA models, VARs, unit roots, and cointegration.
Course Outline:
Week 1, 2: basic concepts in time series review of large sample theory and basic concepts (Section 2.1, 2.2 of Hayashi) application: nominal interest rates as predictors of inflation (Section 2.11 of Hayashi) Week 3 ,4: time series models linear processes and MA processes (Section 6.1 of Hayashi) ARMA processes (Section 6.2 of Hayashi) Week 5: Vector Autoregressions Chapter 11 of Hamilton Week 6 and 7: unit roots and cointegration Chapter 9, 10 of Hayashi application: PPP (purchasing power parity) (Section 9.6 of Hayashi) Week 8: Final exam
3.
Grading:
There will two or possibly three homework problem sets.
The grade will depend on those homework assignments
(34%) and the final (66%).
4.
Textbooks: (4-1:Required 4-2:Others)
F. Hayashi, Econometrics, Princeton University Press, 2000. J. Hamilton, Time Series Analysis, Princeton University Press, 1994.
5. Note: The prerequisite is ECO6700E (Advanced Econometrics I). class meeting.
Presentation files will be made available before each
GRIPS, Spring 2016
VARs and SVARs •
Covers Hamilton’s Section 11.4 (The Impulse Response Function), 11.6 (VAR and Structural Econometric Models)
•
You don’t have to read those sections. However, glancing them after the class might be useful.
•
The “sub 0” notation is suspended (to be consistent with Hamilton). The sample size is T .
1. From Matrix Algebra: Matrix Decompositions •
(The “LDU” decomposition when applied to positive definite matrixes) Let Ω be a positive definite symmetric matrix. Then there exists a unique lower triangular matrix A with 1’s on the diagonal and a unique diagonal matrix D with positive entries on the diagonal, such that Ω = A D A′ .
•
(The Cholesky decomposition: the “LU” decomposition when applied to positive definite matrixes) Let Ω be a positive definite symmetric matrix. Then there exists a unique lower triangular matrix P with positive entries on the diagonal such that Ω = P P′ .
•
Obviously, A, D, and P are related as P = AD 1/2 where D1/2
1
≡
square root of D.
2. Impulse Responses •
VAR, VMA (VAR)
yt = (n 1) ×
(VMA) (VMA) •
yt (n×1)
yt+s =
=
c + Φ1 yt−1 + Φ2 yt−2 + · · · + Φ p yt− p +
(n×1)
µ (n×1)
(n×n) (n×1)
+
εt (n×1)
(n×n) (n×1)
+ Ψ1
εt−1 (n×n) (n×1)
+ Ψ2
(n×n) (n×1)
εt−2 (n×n) (n×1)
εt , (n×1)
+ ··· .
(11.1.1) (11.4.1)
µ + εt+s + Ψ1 εt+s−1 + Ψ2 εt+s−2 + · · · + Ψs−1 εt+1 + Ψs εt + Ψs+1 εt−1 + · · ·
.
IRs (Impulse Responses Defined) ∂ Et (yt+s ) = Ψs . ∂ ε′t (n×n) – The resonse of the i-th element of yt+s to a one-time impulse in y jt is the (i, j ) element of Ψs . – History-independent. – Viewed as a function of s, it is called the impulse response function .
•
Orthogonalized IRs. – Define: Ω
Var(εt ); A and D as in the “LDU” decomposition of Ω; P as in the “LU” decomposition of Ω; ut ≡ A 1 εt , so εt = Au t and Var(ut ) = D . ≡
−
– The orghogonalized IR of the i-th element of yt+s to a one-time impulse in y jt is the (i, j ) element of ∂ Et (yt+s ) = Ψs A . ∂ u′t (n×n)(n×n) – Takes into account the correlation among εt . – Depends on the ordering of the variables. – If the size of the impulse is one standard deviation of the shock to y jt , the orthogonalized IR of the i-th element of yt+s to y jt is the (i, j ) element of Ψs
A D1/2 = Ψs
(n×n)(n×n)(n×n)
This is the default IR output of Eviews.
2
P .
(n×n)(n×n)