EC403, Part 1 Oliver Linton October 14, 2002
2
Contents 1 Linear Regression
7
1.1 The Model . . . . . . . . . . . . . . . . . . . . . 1.2 The OLS Procedure . . . . . . . . . . . . . . . . 1.2. 1.2.1 1 Som Some Alte Altern rnat ativ ivee Estim stimat atio ion n Parad aradig igm ms 1.3 1.3 Geom Geomet etry ry of OL OLS S and and Parti artiti tion oned ed Regre egress ssio ion n . 1.4 Goodness of Fit . . . . . . . . . . . . . . . . . . 1.5 Functional Form . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
2 Statistical Properties of the OLS Estimator
7 10 12 14 19 21 25
2.1 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 Hypothesis Testing
3.1 3.2 3.3 3.4 3.5 3.6 3.6 3.7
33
General Notations . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . Test of a Single Linear Hypot pothesis . . . . . . Test of a Multiple Linear Hypot pothesis . . . . Test of Multi Multiple ple Linea Linearr Hypothesis Hypothesis Based Based on on Exam Exampl ples es of F —Tests, F —Tests, t t vs. F . . . . . . . . Likelihoo ood d Based Testing . . . . . . . . . . .
. . . .
. . . . Þt . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
4 Further Topics in Estimation:
4.1 4.2 4.3 4.4 4.5 4.6
Omission of Relevant Variables Inclusion of irrelevant variables Model Selection . . . . . . . . . Multicollinearity . . . . . . . . Inßuential Observations . . . . Missing Observations . . . . . . 3
35 36 37 40 42 46 48 53
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
53 55 56 57 59 60
4
CONTENTS
5 Asymptotics
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8
65
Types of Asymptotic Convergence . . . . . . . . . . . . . Laws of Large Numbers and Central Limit Theorems . . Additional Results . . . . . . . . . . . . . . . . . . . . . Applications to OLS . . . . . . . . . . . . . . . . . . . . Asymptotic Distribution of OLS . . . . . . . . . . . . . . Order Notation . . . . . . . . . . . . . . . . . . . . . . . Standard Errors and Test Statistics in Linear Regression The delta method . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
6 Errors in Variables
65 67 69 70 71 73 73 75 77
6.1 Solutions to EIV . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2 Other Types of Measurement Error . . . . . . . . . . . . . . . 82 6.3 Durbin-Wu-Hausman Test . . . . . . . . . . . . . . . . . . . . 83 7 Heteroskedasticity
7.1 7.2 7.3 7.4 7.5
E! ects of Heteroskedasticity . . . Plan A: Eicker-White . . . . . . . Plan B: Model Heteroskedasticity Properties of the Procedure . . . Testing for Heteroskedasticity . .
85
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
8 Nonlinear Regression Models
8.1 8.2 8.3 8.4
93
Computation . . . . . . . . . . . . . . . . Consistency of NLLS . . . . . . . . . . . . Asymptotic Distribution of NLLS . . . . . Likelihood and E"ciency . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . . . . . 94 . . . . . . . 96 . . . . . . . 98 . . . . . . . 10 1
9 Generalized Method of Moments
9.1 9.2 9.3 9.4 9.5 9.6
Asymptotic Properties in the iid case Test Statistics . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . Time Series Case . . . . . . . . . . . Asymptotics . . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . .
85 87 88 89 90
103
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
105 107 108 111 113 114
CONTENTS 10 Time Series
5 117
10.1 Some Fundamental Properties . . . . . . . . . . . . . . . . . . 117 10.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 10.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 10.4 Autocorrelation and Regression . . . . . . . . . . . . . . . . . 126 10.5 Testing for Autocorrelation . . . . . . . . . . . . . . . . . . . . 129 10.6 Dynamic Regression Models . . . . . . . . . . . . . . . . . . . 130 10.7 Adaptive expectations . . . . . . . . . . . . . . . . . . . . . . 132 10.8 Partial adjustment . . . . . . . . . . . . . . . . . . . . . . . . 133 10.9 Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . 133 10.10Estimation of ADL Models . . . . . . . . . . . . . . . . . . . . 134 10.11Nonstationary Time Series Models . . . . . . . . . . . . . . . 135 10.12Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 10.13Testing for Unit Roots . . . . . . . . . . . . . . . . . . . . . . 138 10.14Cointegration . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 10.15Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 10.16GARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . 141 10.17Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6
CONTENTS
Chapter 1 Linear Regression 1.1
The Model
• The Þrst part of the course will be concerned with estimating and testing in the linear model. We will suggest procedures and derive their properties under certain assumptions. The linear model is the basis for most of econometrics and a Þrm grounding in this theory is essential for future work. • We observe the following data
!y $ "# ... %& ; ! yx · · · "# ... 1
y =
n
11
X =
x1n
xK 1 ... xKn
$ !x $ %& = "# ... %& , 0 1
x0n
where rank(X ) = K. Note that this is an assumption, but it is immediately veriÞable from the data in contrast to some other assumptions we will make. • It is desirable for statistical analysis to specify a model of how these data were generated. We suppose that there is a random mechanism which is behind everything - the data we have is one realisation of 7
8
CHAPTER 1. LINEAR REGRESSION
an inÞnity of such potential outcomes. We shall make the following assumptions regarding the way y, X were generated:
• Fixed Design Linear Model — (A1) X is Þxed in repeated samples — (A2) !! = (! 1 , . . . , ! K )0 such that E (y) = X ! . — (A3) Var(y) = " 2 I n×n .
• We stick with Þxed design for most of the linear regression section. Fixed design is perhaps unconvincing for most economic data sets, because of the asymmetry between y and x. That is, in economic datasets we have no reason to think that some data were randomly generated while others were Þxed. This is especially so in time series when one regressor might be a lagged value of the dependent variable. • A slightly di! erent speciÞcation is Random Design Linear Model — (A1r) X is random with respect to repeated samples — (A2r) !! s.t. E (y|X ) = X ! — (A3r) Var(y|X ) = " 2 I n×n ,
where formally A2r and A3r hold with probability one. • However, one can believe in a random design model, but want to conduct inference in the conditional distribution [given X ]. This is sensible at least in the cross-section case where there are no lagged dependent variables. In this case, we are e! ectively working in a Þxed design model. So the real distinction in this case is whether one evaluates quantities in the conditional or unconditional distribution.
9
1.1. THE MODEL
• Finally, we write the regression model in the more familiar form. De Þne # = y " X ! = (#1 , . . . , #n )0 , then y = X ! + #, where [in the Þxed design] E (#) = 0 E (##0 ) = " 2 I n . The linear regression model is more commonly stated like this with statistical assumptions made about the unobservable # rather than directly on the observable y. The assumptions about the vector # are quite weak in some respects - the observations need not be independent and identically distributed, since only the Þrst two mometns of the vector are speci Þed - but strong in regard to the second moments themselves. • It is worth discussing here some alternative assumptions made about the error terms. For this purpose we shall assume a random design, and moreover suppose that (xi , #i ) are i.i.d. In this case, we can further assume that — E (#i xi ) = 0 — E (#i |xi ) = 0, denoted #i # xi — #i are i.i.d. and independent of x i , denoted #i ## xi . — #i $ N (0, "2 ).
• The Þrst assumption, called an unconditional moment condition, is the weakest assumption needed to ‘identify’ the parameter ! . • The second assumption, called a conditional moment restriction, is a little bit stronger. It is really just a rewriting of the deÞnition of conditional expectation. • The third assumption is much stronger and is not strictly necessary for estimation purposes although it does have implications about e "ciency and choice of estimator. • The fourth assumption we will sometimes make in connection with hypothesis testing and for establishing optimality of least squares.
10
CHAPTER 1. LINEAR REGRESSION
1.2
The OLS Procedure
• In practice we don’t know the parameter ! and seek to estimate it from the data. • For any b, deÞne X b and u(b) = y " Xb. Then u(b) is the vector of discrepancies between the observed y from the predicted by X b.
b
• The Ordinary Least Squares (OLS) procedure chooses ! to minimize the quadratic form n
0
S (b) = u(b) u(b) =
X
u2i (b) = (y " Xb)0 (y " Xb)
i=1
with respect to b % Rk . This is perhaps the main estimator of ! , and we shall study its properties at length. • The Þrst question is whether a minimum exists. Since the criterion is a continuous function of b, a minimum over any compact subset always exists. • A necessary condition for the uniqueness of a solution is that n & K. If n = K, the solution essentially involves interpolating the data, i.e., the Þtted value of y will be equal to the actual value.
b
• When the assumption that rank(X ) = K is made, ! is uniquely deÞned for any y and X independently of model; so there is no need for assumptions A1-A3 when it comes to computing the estimator. • We now give two derivations of the well-known result that
b
! = (X 0 X )!1 X 0 y.
• We suppose the answer is given by this formula and demonstrate that ! minimizes S (b) with respect to b. Write
b
b b
u(b) = y " X ! +X ! " Xb,
11
1.2. THE OLS PROCEDURE
so that
b b b b b b b b b b b b b b b b b b bb b b b
S (b) = (y " X ! +X ! " Xb)0 (y " X ! +X ! " Xb) = (y " X ! )0 (y " X ! )
+(! " b)0 X 0 X (! " b)
+(y " X ! )0 X (! " b) + (! " b)0 X 0 (y " X ! ) = because
(y " X ! )0 (y " X ! ) + (! " b)0 X 0 X (! " b), X 0 (y " X ! ) = X 0 y " X 0 y = 0.
But
(! " b)0 X 0 X (! " b) & 0,
and equality holds only when b = ! .
• A minimizer of S (b) must satisfy the vector of Þrst order conditions: $ S = 2X 0 (y " X ! ) = 0. $ b
Therefore,
X 0 y = X 0 X ! .
Now we use the assumption that X is of full rank. This ensures that X 0 X is invertible, and ! = (X 0 X )!1 X 0 y
as required. To verify that we have found a local minimum rather than maximum it is necessary to calculate the second derivatives $ 2 S =2X 0 X > 0. 0 $ b$ b
• The vector derivatives follow by straightforward calculus $ $ b j
n
X i=1
n
2
ui (b) = 2
X i=1
$ ui ui (b) = "2 $ b j
n
X i=1
ui (b)xij ,
12
CHAPTER 1. LINEAR REGRESSION
since
•
$ ui = "xij . $ b j
Characterization of the Solution.
b
DeÞne the Þtted value y =
b
X ! and the OLS residuals
b b b b b X b X b X b b P bb u = y " y = y " X ! .
• The OLSE ! solves the normal equations X 0 u = 0, i.e., n
x1i ui = 0
i=1 n
x2i ui = 0
i=1
.. .
n
xKi ui = 0.
i=1
• We say that X is orthogonal to u, denoted X # u. Note that if, as usual X 1i = 1, then, we have ni=1 ui =0.
1.2.1
Some Alternative Estimation Paradigms
• We brießy mention some alternative estimation methods which actually lead to the same estimator as the OLS estimator in some special cases, but which are more broadly applicable. • Maximum Likelihood. Suppose we also assume that y $ N (X ! , " 2 I ). Then the density function of y [conditional on X ] is 1
µ
¶
1 f y |X (y) = exp (y " X ! )0(y " X ! ) . " 2 n/2 (2%" ) 2 • The density function depends on the unknown parameters ! , " 2 , which we want to estimate. We therefore switch the emphasis an call the following quantity the log likelihood function for the observed data n n 1 `(b, & 2 |y, X ) = " log 2% " log &2 " 2 (y " Xb)0 (y " Xb), 2 2 2&
13
1.2. THE OLS PROCEDURE
where b and & are unknown parameters.
b b
• The maximum likelihood estimator ! mle , " 2mle maximizes `(b, & 2) with respect to b and & 2 . It is easy to see that
b b b b ! mle = !
1 (y " X ! mle )0 (y " X ! mle ). n Basically, the criterion function is the least squares criterion apart from an a"ne transformation involving only &. " 2mle =
b
• Note however, that if we had a di! erent assumption about the errors than A4, e.g., they were from a t-distribution, then we would have a di! erent likelihood and a di! erent estimator than ! . In particular, the estimator may not be explicitly de Þned and may be a nonlinear function of y.
b
• Method of Moments. Suppose that we deÞne parameters through some population moment conditions; this can arise from an economic optimization problem, see below. • For example, suppose that we say that ! is deÞned as the unique parameter that satisÞes the K moment conditions [we need as many moment conditions as parameters] E [xi (yi " x0i ! )] = 0. Note that this is the natural consequence of our assumption that E (#i xi ) = 0. • Replacing the population by the sample average we must that n 1 xi (yi " x0i b) = 0. n i=1
X b
Þnd
b such
The solution to this is of course ! = (X 0 X )!1 X 0 y,
i.e., the MOM estimator is equal to OLS in this case. Thus, for the moment conditions above we are lead to the least squares estimator.
14
CHAPTER 1. LINEAR REGRESSION
• However, if we chose some other conditions, then a di! erent estimator results. For example, suppose that we assume that E [xi (yi " x0i ! )3 ] = 0, we would be lead to a di! erent estimator - any solution of 1 n
n
X
xi (yi " x0i b)3 = 0.
i=1
In general, this would be more complicated to analyze. • We emphasize here that the above estimation methods are all suggested or motivated by our assumptions, but of course we can always carry out the procedure without regard to underlying model - that is, the procedures only require data, not assumptions.
1.3
Geometry of OLS and Partitioned Regression
• We want to give a geometric interpretation to the OLS procedure. • The data: y, x1 , . . . , xK , can all be viewed as elements of the vector space Rn. DeÞne the set C (X ) = {'1 x1 + · · · + 'K xK } = {X ' : ' % RK } ' Rn , otherwise known as the column span of X . • Then, C (X ) is a linear subspace of Rn of dimension K assuming that the matrix X is of full rank. If it is only of rank K " with K " < K then C (X ) is still a linear subspace of Rn but of dimension K. • The OLS procedure can equivalently be de Þned as Þnding the point in C (X ) closest to y, where closeness is measured in terms of Euclidean distance, i.e., d(y,Xb) = ky " Xbk2 = (y " Xb)0 (y " Xb) is the Euclidean distance of y to the point X b % C (X).
1.3. GEOMETRY OF OLS AND PARTITIONED REGRESSION
15
• This is an old problem in geometry, which is now given a key role in abstract mathematics. • The projection theorem [Hilbert] says that there is a unique solution to the minimization problem, call it y, which is characterized by the fact that u = y " y
b
b b
is orthogonal to C (X).
• Equivalently we can write uniquely y = y + u,
bb
where y % C (X ) and u % C # (X ) [the space C # (X ) is called the orthocomplement of C (X ), and consists of all vectors orthogonal to C (X )]. Essentially, one is dropping a perpendicular, and the procedure should be familiar from high school geometry.
b
b
• For example, let n = 3 and
!1 0$ X = # 0 1 & . 0 0
Then C (X ) is the set of all vectors in R3 with third component zero. What is the closest point for the example above with y = (1, 1, 1)0 ? This is (1, 1, 0)0 = X ! , u = (0, 0, 1)0 In fact u is orthogonal to C (X ), i.e., u % C # (X ) = (0, 0, ')0 .
b
b b
b
• In general how do we Þnd y ? When X is of full rank we can give a simple explicit solution y = P X y, where the Projector matrix
b b
P X = X (X 0 X )!1 X 0 projects onto C (X ).
16
CHAPTER 1. LINEAR REGRESSION
• Let u = y " y = M X y, where the Projector matrix
b b
M X = I " X (X 0 X )!1 X 0
projects onto C # (X ). Thus for any y, we can write y = y + u = P X y + M X y.
bb
The matrices P X and M X are symmetric and idempotent, i.e., 0 2 P X = P X and P X = P X .
After applying P X once you are ready in C (X ). This implies that P X X = X and M X X = 0, so that P X M X y = 0 for all y.
b b
b
• Since y % C (X ), we can rewrite it as y = X ! , so that ! = (X 0 X )!1 X 0 y.
b
• The space C (X ) is invariant to nonsingular linear transforms X " 7 ( XAK ×K , where det A 6 = 0. Let z % C (X ). Then there exists an ' % RK such that z = X '. Therefore, z = XAA!1 ' = X A( , where ( = A!1 ' % RK , and vice-versa. • Since C (X ) is invariant to linear transformations, so are y and u (but not ! ). For example, rescaling of the components of X does not a! ect the values of y and u.
b b b
b b
y on (x1 , x2 , x3 )(1)
y on (x1 + x 2 , 2x2 " x3 , 3x1 " 2x2 + 5x 3 )(2) in which case the transformation is
!1 A = # 1
0 3 2 "2 0 "1 5
$ &,
which is of full rank. Therefore, (1) and (2) yield the same y, u.
bb
1.3. GEOMETRY OF OLS AND PARTITIONED REGRESSION
17
• Emphasizing C (X ) rather than X itself is called the coordinate free approach. Some aspects of model/estimate are properties of C (X ) choice of coordinates is irrelevant. • When X is not of full rank — the space C (X ) is still well deÞned, as is the projection from y
onto C (X ). — The Þtted value y and residual u are uniquely deÞned in this case,
b
b b
— but there is no unique coe "cient vector ! . — This is the case commonly called multicollinearity.
• We next consider an important application of the projection idea. Partition X = (X 1n×K 1 , X 2n×k2 ) , K 1 + K 2 = K,
b
and suppose we are interested in obtaining the coe "cient ! 1 in the projection of y onto C (X ). • A key property of projection is that if X 1 and X 2 are orthogonal, i.e., if X 10 X 2 = 0, then P X = P X1 + P X2 . This can be veriÞed algebraically, but also should be obvious geometrically. In this case, write
b b
b b
y = X ! = P X y = P X1 y + P X2 y = X 1 ! 1 + X 2 ! 2.
b
This just says that if X 1 and X 2 were orthogonal, then we could get ! 1 by regressing y on X 1 only, and ! 2 by regressing y on X 2 only.
b
• Very rarely are X 1 and X 2 orthogonal, but we can construct equivalent regressors that are orthogonal. Suppose we have general X 1 and X 2 , whose dimensions satisfy K 1 + K 2 = K . We make the following observations: — (X 1 , X 2 ) and (M 2 X 1 , X 2 ) span the same space. This follows be-
cause X 1 = M 2 X 1 + P 2 X 1 ,
18
CHAPTER 1. LINEAR REGRESSION
where C (P 2 X 1 ) ) C (X 2 ). Therefore, C (M 2 X 1, X 2 ) = C (X 1 , X 2 ). — M 2 X 1 and X 2 are orthogonal.
• This says that if we regress y on (X 1 , X 2 ) or y on (M 2X 1 , X 2 ) we get the same y and u, and that if we wanted the coe "cients on M 2 X 1 from the second regression we could in fact just regress y on M 2 X 1 only.
b b b b b b b b b b b b b b b b b b
• What are the coe"cients on M 2 X 1 ? Recall that y = X 1! 1 + X 2 ! 2
= (M 2 + P 2 )X 1 ! 1 + X 2 ! 2
= M 2X 1 ! 1 + X 2 [! 2 + (X 20 X 2)!1 X 20 X 1 ! 1 ] = M 2X 1 ! 1 + X 2 C ,
where
C = ! 2 + (X 20 X 2 )!1X 20 X 1 ! 1 .
• So the coe"cient on M 2X 1 is the original ! 1, while that on X 2 is some combination of ! 1 and ! 2 . Note that M 2 X 1 are the residuals from a regression of X 1 on X 2 . • Practical Implication. If K is large and primarily interested in Þrst K 1 variables, then we can get ! 1 by regressing y [or M 2 y equivalently] on M 2 X 1 only, i.e.,
b
b
! 1 = (X 10 M 2 X 1 )!1 X 10 M 2 y = (X 10 M 2 M 2 X 1 )!1 X 10 M 2 M 2 y.
This involves inversion of only K 1 × K 1 and K 2 × K 2 matrices, which involves less computing time than inverting K × K matrices, especially when K is large [this computation can be as bad as O(K 3 )]. • Suppose that X 2 = (1, 1, . . . , 1)0 = i, then M 2 = I n " i(i0 i)!1 i0 = I n "
ii 0 n
19
1.4. GOODNESS OF FIT
and M 2 x1
n×1
1 = x1 " n
!1$ !x "# ... %& = "#
1i
n
X
x1i
1
i=1
" x1
.. .
x1n " x1
$ %& .
When regression includes an intercept, can Þrst demean the X variables (and the y’s) then do regression on the demeaned variables.
1.4
Goodness of Fit
• How well does the model explain the data? One possibility is to measure the Þt by the residual sum of squares n
RSS =
X b
(yi " yi )2.
i=1
In general, the smaller the RSS the better. However, the numerical value of RSS depends on the units used to measure y in so that one cannot compare across models. • Generally used measure of goodness of Þt is the R 2 . In actuality, there are three alternative deÞnitions in general. — One minus the ratio of the residual sum of squares to total sum
of squares,
PP b b P P P b b b b PP b P bPb RSS R12 = 1 " =1" T SS
n 2 i=1 (yi " yi ) . n 2 i=1 (yi " y)
— The sample correlation squared between y and y,
R22
=
[
n i=1 (yi
n i=1 (yi
" y)(yi " y)]2
" y)2
n i=1 (yi
" y )2
.
— The ratio of explained sum of squares to total sum of squares
R32
Here, y =
ESS = = T SS
n i=1 yi /n and
y =
n i=1 (yi n i=1 (yi
n i=1 yi /n.
" y)2 " y)2
.
20
CHAPTER 1. LINEAR REGRESSION
• Theorem. When an intercept is included, all three measures are the same. • Proof of R21 = R22. Since an intercept is included, we have n
X b b X b b X b X b X b b ui = 0,
i=1
which implies that y = y. Therefore, n
n
(yi " y)(yi " y ) =
n
i=1
i=1
because
(yi " y)2 ,
(yi " y)(yi " y) =
i=1
n
ui (yi " y) = 0.
i=1
• Proof of R21 = R32. Similarly, n
X
n
2
(yi " y) =
i=1
n
X b X b 2
(yi " y)2 .
(yi " yi ) +
i=1
i=1
• If an intercept is included, then 0 * R2 * 1. If not, then 0 * R22 * 1, but R 32 could be greater than one, and R 12 could be less than zero. • If y = ' + ! x + u, then R2 is the squared sample correlation between y and x. • The R2 is invariant to some changes of units. • If y ( 7 ay + b for any constants a,b, then — yi 7( ayi + b and
b b
— y 7( ay + b, — so R 2 is the same in this case.
7 XA for a nonsingular matrix A, then y is un— Clearly, if X ( changed, as is y and y.
b
21
1.5. FUNCTIONAL FORM
• R2 always increases with addition of variables. With K = n we can make R 2 = 1. • Theil’s adjusted R 2 is deÞned as follows 2
R =1"
n"1 (1 " R2). n " K
This amounts to dividing the sum of squares by the appropriate degrees of freedom, so that 2
1"R = It follows that
1 n!K 1 n!1
PP b n i=1 (yi n i=1 (yi
" y i )2
" y)2
.
2
n " 1 !R2 n"1 = (1 " R2). " 2 n " K !K (n " K ) !K
!R
| {z } | {z } +
!
This measure allows some trade-o ! between Þt and parsimony.
1.5
Functional Form
• Linearity can often be restrictive. We shall now consider how to generalize slightly the use of the linear model, so as to allow certain types of nonlinearity, but without fundamentally altering the applicability of the analytical results we have built up. Wages = ' + ! ed + ( U NION + u Wages = ' + ! ed + ( ab + ) ed · ab + u Wages = ' + ! ed + ( ex + ) ex2 + *ex3 + u log Wages = ' + ! ed + ( U NION + u log
fs = ' + ! inc + u 1 " fs
22
CHAPTER 1. LINEAR REGRESSION
• These are all linear in the parameters model, i.e., can write y = X ! + u for some X , some ! , some y. • Another interesting example is Splines. This is a Piecewise Linear function. For example, suppose we have a scalar regressor x, which is time, i.e., x t = t, t = 1, 2, . . . , T . Further suppose that
'( ' + ! x + u ! x + u y = ) '' + + ! x + u 1
1
2
2
3
3
if x * t"1 if t"1 * x * t"2 if x & t"2 .
• This can be expressed as follows: y = ' 1 + ! 1 x + ( 1 D1 + ) 1 D1 · x + ( 2 D2 + ) 2 D2 · x + u, where D1 = D2 =
½ ½
1 if x & t"1 , 0 else 1 if x & t"2 , 0 else.
• How do we impose that the function join up? We must have '1 + ! 1 t"1 = ' 1 + ( 1 + ( ! 1 + ) 1 )t"1 '1 + ! 1 t"2 + ( 1 + ) 1 t"2 = ' 1 + ( 1 + ( ! 1 + ) 1 )t"1 + ( 2 + ) 2 t"2 ,
which implies that ( 1 = ") 1 t"1 and ( 2 = ") 2 t"2 ,
which are two linear restrictions on the parameters, i.e., y = ' 1 + ! 1 x + (D1 x " D1 t"1 )) 1 + (D2 x " D2 t"2 )) 2 + u. •
Some Nonlinear in Parameters Functions
23
1.5. FUNCTIONAL FORM
• Box—Cox
x! " 1
y = ' + !
+
+ u,
where as + ( 0,
x! " 1
( ln(x); + x! " 1 as + ( 1, (x"1 +
• Real money demand y = ! 1 X 1 +
! 2 + u. x2 " (
If there exists ( > 0, then we have a Liquidity trap. • CES production function
£
Q = ! 1 ! 2 K !" 3 + (1 " ! 2 )L!" 3
¤
" 4 /" 3
+ u.
Methods for treating these models will be considered below.
24
CHAPTER 1. LINEAR REGRESSION
Chapter 2 Statistical Properties of the OLS Estimator • We now investigate the statistical properties of the OLS estimator in both the Þxed and random designs. Speci Þcally, we calculate its exact mean and variance. We shall examine later what happens when the sample size increases.
b
• The Þ rst thing to note in connection with ! is that it is linear in y, i.e., there exists a matrix C not depending on y such that
b
! = (X 0X )!1 X 0 y = Cy.
This property makes a lot of calculations simple.
b
• We want to evaluate how ! varies across hypothetical repeated samples. We shall examine both the Þxed design and the random design case. The Þxed design is the main setting we use in this course; it is simpler to work with and gives the main intuition. The random design approach is given here for completeness; it will be come more relevant later in the course. • Fixed Design . First,
b
E (! ) = (X 0 X )!1 X 0 E (y) = (X 0 X )!1 X 0 X ! = ! ,
b
where this equality holds for all ! . We say that ! is unbiased. 25
26CHAPTER 2. STATISTICAL PROPERTIES OF THE OLS ESTIMATOR
b
• Furthermore, we shall calculate the K × K covariance matrix of ! ,
b b b b
var(! ) = E {(! " ! )(! " ! )0 }.
b b
This has diagonal elements var(! j ) and o! -diagonals cov(! j , ! k ). We have var((X 0 X )!1 X 0 y) = (X 0 X )!1 X 0varyX (X 0 X )!1 = E {(X 0 X )!1 X 0 ##0 X (X 0 X )!1 } = (X 0 X )!1 X 0" 2 IX (X 0 X )!1 = " 2 (X 0 X )!1 . • Random Design . For this result we need E [#i |xi ] = 0. We Þrst condition on the matrix X ; this results in a Þxed design and the above results hold. Thus, if we are after conditional results, we can stop here. If we want to calculate unconditional mean and variance we must now average over all possible X designs. Thus
b b
E (! ) = E {E (! |X )} = E (! ) = ! . On average we get the true parameter ! . Note that this calculation uses the important property called “The Law of Iterated Expectation”. The most general version of this says that E (Y |I 1 ) = E [E (Y |I 2 )|I 1 ], whenever I 1 ' I 2 for two information sets I 1 , I 2 . • Note that if only E [xi #i ] = 0, then the above calculation may not be valid. For example, suppose that Y i = X i3 , where X i is i.i.d. standard normal. Then ! = 3 minimizes E [(Y i " bX i )2 ]. Now consider the least squares estimator
P b P ! =
n i=1 X i Y i n 2 i=1 X i
=
PP
n 4 i=1 X i . n 2 i=1 X i
You can’t show that this is unbiased, and indeed it isn’t. • As for the variance, we use another important property var[y] =E [var(y|X )] + var[E (y|X )],
27 which is established by repeated application of the law of iterated expectation. We now obtain
b b
var(! ) = E var(! |X ) = " 2 E {(X 0 X )!1 }. This is not quite the same answer as in the Þxed design case, and the interpretation is of course di! erent. • The properties of an individual coe "cient can be obtained from the partitioned regression formula
e
! 1 = (X 10 M 2 X 1 )!1 X 10 M 2 y.
• In the Fixed Design
b
var[! 1 ] = (X 10 M 2 X 1 )!1 X 10 M 2E ##0 M 2X 1 (X 10 M 2 X 1 )!1 = " 2(X 10 M 2X 1 )!1 . • In the special case that X 2 = (1, . . . , 1)0 , we have
b P b
var(! 1 ) =
"2
n i=1 (xi
" x)2
.
This is the well known variance of the least squares estimator in the single regressor plus intercept regression. • We now turn to the distribution of ! . This will be important when we want to conduct hypothesis tests and construct conÞdence intervals. In order to get the exact distribution we will need to make an additional assumption. — (A4) y $ N (X ! , " 2 I ) or — (A4r) y|X $ N (X ! , "2 I ).
• Under A4,
b b
! $ N (! , "2 (X 0 X )!1 )
in the Þxed design case, because n
0
! = (X X )
b
!1
0
X y =
X
ci yi ,
i=1
i.e., ! is a linear combination of independent normals.
28CHAPTER 2. STATISTICAL PROPERTIES OF THE OLS ESTIMATOR
b Z µ ¶ P
• Under A4r, the conditional distribution of ! given X is normal with mean ! and variance " 2(X 0 X )!1 . However, the unconditional distribution will not be normal - in fact, it will be a scale mixture of normals meaning that, in the scalar case for simplicity, its density function is f " (z ) =
b
where g is the density of ( density function.
2.1
1 z " ! , g(v)dv, " · v " · v n 2 1/2 i=1 xi )
and , is the standard normal
Optimality
• There are many estimators of ! . Consider the scalar regression yi = n n ! xi + #i . The OLS estimator is ! = i=1 xi yi / i=1 x2i . Also plausible are ! = y/x and ! = ni=1 yi /xi , as well as nonlinear estimators such as the LAD procedure
P P b P X
e
n
arg min "
|yi " ! xi |.
i=1
be
• In fact, ! , ! , and ! are all linear unbiased. How do we choose between estimators? Computational convenience is an important issue, but the above estimators are all similar in their computational requirements. We now investigate statistical optimality. • DeÞnition: The mean squared error (hereafter MSE) matrix of a generic estimator - of a parameter - % R p is
b
b b b b b b b b | b b{z b b } | b {z b } E [(- " -)(- " -)0 ]
= E [(- " E (-)+E (-) " -)(- " E (-)+E (-) " -)0 ] = E [(- " E (-))(- " E (-))0 ] variance
+ [E (-) " -][E (-) " -]0 . squared bias
29
2.1. OPTIMALITY
• The MSE matrix is generally a function of the true parameter -. We would like a method that does well for all -, not just a subset of parameter values - the estimator - = 0 is an example of a procedure that will have MSE equal to zero at - = 0, and hence will do well at this point, but as - moves away, the MSE increases quadratically without limit.
b
• MSE deÞnes a complete ordering when p = 1, i.e., one can always rank any two estimators according to MSE. When p > 1, this is not so. In the general case we say that - is better (according to MSE) than - if
b
e
B &A
(i.e., B " A is a positive semideÞnite matrix), where B is the MSE matrix of - and A is the MSE of -.
e
b
• For example, suppose that A =
·¸ · ¸ 10 , 01
B=
2 0 . 0 1/4
In this case, we can not rank the estimators. The problem is due to the multivariate nature of the optimality criterion. • One solution is to take a scalar function of MSE such as the trace or determinant, which will result in a complete ordering. However, di ! erent functions will rank estimators di ! erently [see the example above]. • Also note that no estimator can dominate uniformly across - according to MSE because it would have to beat all constant estimatros which have zero MSE at a single point. This is impossible unless there is no randomness. • One solution is to change the criterion function. For example, we might take max# tr(MSE ), which takes the most pessimistic view. In this case, we might try and Þnd the estimator that minimizes this criterion - this would be called a minimax estimator. The theory for this class of estimators is very complicated, and in any case it is not such a desirable criterion because it is so pessimistic about nature trying to do the worst to us.
30CHAPTER 2. STATISTICAL PROPERTIES OF THE OLS ESTIMATOR • Instead, we reduce the class of allowable estimators. If we restrict attention to unbiased estimators then this rules out estimators like - = 0 because they will be biased. In this case there is some hope of an optimality theory for the class of unbiased estimators.
b
• We will now return to the linear regression model and make the further restriction that the estimators we consider are linear in y. That is, we suppose that we have the set of all estimators ! that satisfy
e e
! = Ay
e
for some Þxed matrix A such that E (! ) = ! , +! . This latter condition implies that (AX " I )! = 0 for all ! , which is equivalent to AX = I . • Gauss Markov Theorem. Assume that A1—A3 hold. The OLS estimator ! is Best Linear Unbiased (BLUE), i.e.,
b
b e e
var(! ) * var(! )
for any other LUE.
b
• Proof. var(! ) = " 2(X 0 X )!1 ; var(! ) = " 2 AA0 and
e b
var(! ) " var(! ) = = = = =
" 2 [AA0 " (X 0 X )!1 ] " 2 [AA0 " AX (X 0 X )!1 X 0 A0 ] " 2 A[I " X (X 0 X )!1 X 0 ]A0 " 2 AMA0 " 2 (AM ) · (M 0 A0 )
& 0.
•
— Makes no assumption about the distribution of the errors; it only assumes 0 mean and "2 I variance. — Result only compares linear estimators; it says nothing about for example ni=1 |yi " ! xi |.
P
31
2.1. OPTIMALITY
— Result only compares unbiased estimators [biased estimators can
have 0 variances]. In fact, although the OLS estimator is admissible with respect to MSE, it is inadmissible with respect to trace mean squared error when the number of regressors is at least three. The Stein estimator is better according to trace mean squared error. Of course in large samples this is all irrelevant.
e
— There are extensions to consider a"ne estimators ! = a + Ay
for vectors a. There are also equivalent results for the invariant quantity y.
b
• If we dispense with the unbiasedness assumption and add the model assumption of error normality we get the well-known result.
b
• Cramèr—Rao Theorem. Under A1-A4, ! is Best Unbiased (statement is for MLE’s). • By making the stronger assumption A4, we get a much stronger conclusion. This allows us to compare say LAD estimation with OLS. • Asymptotically, a very large class of estimators are both unbiased and indeed linear so that the Gauss-Markov and Cramèr—Rao Theorems apply to a very broad class of estimators when the words “for large n” are inserted.
32CHAPTER 2. STATISTICAL PROPERTIES OF THE OLS ESTIMATOR
Chapter 3 Hypothesis Testing • In addition to point estimation we often want to know how good our estimator is and whether it is compatible with certain preconceived ‘hypotheses’ about the data. • Suppose that we observe certain data (y, X ), and there is a true data distribution denoted by f , which is known to lie in a family of models F . We now suppose there is a further reduction called a Hypothesis H 0 ' F . For example, H 0 could be the: — Prediction of a scientiÞc theory. For example, the interest elas-
ticity of demand for money is zero; the gravitational constant is 9. — Absence of some structure, e.g., independence of error term over
time, homoskedasticity etc. — Pretesting (used as part of model building process).
• We distinguish between a Simple hypothesis (under H 0 , the data distribution is completely speci Þed) and a Composite hypothesis (in which case, H 0 does not completely determine the distribution, i.e., there are ‘nuisance’ parameters not speci Þed by H 0). • We also distinguish between Single and Multiple hypotheses (one or more restriction on parameters of f ). • We shall also introduce the alternative hypothesis H A, which will be the complement of H 0 in F , i.e., F = H 0 , H A . That is, the choice of 33
34
CHAPTER 3. HYPOTHESIS TESTING
F is itself of some signi Þcance since it can restrict the range of values taken by the data distribution. We shall also distinguish between onesided and two-sided alternatives; when we have a single real-valued parameter this is an easy notion to comprehend. • Examples — The theoretical model is the Cobb—Douglas production function
Q = AK $L" . Empirical version: take logs and add an error term to give a linear regression q = a + 'k + ! ` + #. It is often of interest whether constant returns to scale operate, i.e., would like to test whether ' + ! = 1
is true. We may specify the alternative as ' + ! < 1, because we can rule out increasing returns to scale. — Market e"ciency
rt = µ + ( 0 I t!1 + #t , where r t are returns on some asset held between period t " 1 and t, while I t is public information at time t. Theory predicts that ( = 0; there is no particular reason to restrict the alternative here. — Structural change
y = ' + ! xt + ( Dt + #t Dt = Would like to test ( = 0.
½
0 , t < 1974 1 , t & 1974.
3.1. GENERAL NOTATIONS
3.1
35
General Notations
• A hypothesis test is a rule [function of the data] which yields either reject or accept outcomes. • There are two types of mistakes that any rule can make: — Type I error is to reject when the null hypothesis is true — Type II error is of accepting a false hypothesis.
• We would like to have as small a Type I and Type II error as possible. Unfortunatley, these are usually in con ßict. The traditional approach is to Þx the Type I error and then try to do the best in terms of the Type II error. • We choose ' % [0, 1] called the size of the test [magnitude of Type I error]. Let T (data) be a test statistic, typically scalar valued. Then, Þnd acceptance region C $ of size ' such that Pr[T % / C $|H 0 ] = ' . The rule is to reject H 0 if T % / C $ and to accept otherwise. The practical problem is how to choose T so that C $ [or equivalently the rejeciton region R $] is easy to Þnd. • DeÞne also Power of test: % = Pr[T % / C $|H A ] = 1 " TypeII.
It is desirable, ceteris paribus, to have a test that maximizes power for any given size. • Optimal testing. Neyman-Pearson Lemma. Suppose you have a parametric model with parameter - and consider the simple null hypothesis against a one-sided alternative: H 0 : - = - 0, H A : - > - 0 or - < - 0 . The likelihood ratio test is Uniformly Most Powerful UMP provided the parametric model has the monotone likelihood ratio (MLR) property. Examples: One parameter exponential families, e.g., Normal,Poisson, and Binomial.
36
CHAPTER 3. HYPOTHESIS TESTING
• Testing against two-sided alternatives, UMP’s do not exist. • Example. X $ N (µ, 1); H 0 : µ = 0 vs. µ > 0. In this case, the best ¯ > z $/n1/2}. For any µ > 0, this test is most rejection region is {X n : X powerful µ = 0 vs. µ. Region and rule distribution is independent of µ. In the two-sided test ¯ | > z $/2 } {X n : | X n1/2 is less powerful than ¯ > z $ } when µ > 0, {X n : X n1/2 and less powerful than ¯ < z $ } when µ < 0. {X n : X n1/2 • Unbiased and Invariant Tests. Just like in estimation it can help to reduce the class of tests. An unbiased test satisÞes % (-) & ' for all - % "1 .
Clearly the one-sided interval is biased because when µ < 0 power is zero. The above two-sided normal test is UMP unbiased. Alternatively can eliminate some tests by requiring invariance under a group of transformations.
3.2
Examples
• Hypothesis Testing in Linear Regression: y $ N (X ! , " 2 I ). — Single (Linear) Hypothesis:
c0 ! = ( % R, e.g., ! 2 = 0 (t—test). — Multiple (Linear) Hypothesis:
Rq ×K ! K ×1 = rq ×1 , e.g., ! 2 = ! 3 = · · · = ! K = 0.
q * K,
3.3. TEST OF A SINGLE LINEAR HYPOTHESIS
37
— Single Non-linear Hypothesis:
! 21 + ! 22 + · · · + ! 2K = 1.
Note that these are all composite hypotheses, i.e., there are nuisance parameters like " 2 that are not speciÞed by the null hypothesis.
3.3
Test of a Single Linear Hypothesis
• We wish to test the hypothesis c0 ! = ( , e.g., ! 2 = 0. Suppose that y $ N (X ! , " 2I ). Then,
b
c0! " ( $ N (0, 1). " (c0 (X 0 X )!1 c)1/2 We don’t know " and must replace it by an estimate. There are two widely used estimates: #0 #
b bb bb " 2mle 2
s
=
n
#0 # = n " K
The Þrst estimate is the maximum likelihood estimator of " 2, which can be easily veriÞed. The second estimate is a modiÞcation of the MLE, which happens to be unbiased. Now de Þne the test statistic
b
c0 ! " ( T = . s(c0 (X 0 X )!1 c)1/2 • Theorem Under H 0 , T $ t(n " K ). • Proof. We show that: — (1)
n!K 2 %2
s $ .2n!K
38
CHAPTER 3. HYPOTHESIS TESTING
b
— (2) s2 and c 0 ! " ( are independent.
This establishes the theorem by the de Þning property of a t-random variable. • Recall that
#0 # = "2
X³ ´ n
i=1
#i "
2
$ .2n .
But # are residuals that use K parameter estimates. Furthermore,
b
#0 # = # 0 M X #
b b
and
E [#0 M X #] = = = = =
E [tr M X ##0 ] tr M X E (##0 ) " 2 tr M X " 2 (n " tr P X ) " 2 (n " K )
because tr(X (X 0 X )!1 X 0 ) = tr X 0 X (X 0 X )!1 = tr I K = K. These calculations show that E #0 # = " 2(n " K ),
b b
which suggests that #0 #/"2 cannot be .2n [and incidentally that Es 2 = " 2 ].
bb
• Note that M X is a symmetric idempotent matrix, which means that it can be written M X = Q#Q0 , where QQ0 = I and # is a diagonal matrix of eigenvalues, which in this case are either zero (K times) or one (n " K times). Furthermore, by a property of the normal distribution, Q# = # "
39
3.3. TEST OF A SINGLE LINEAR HYPOTHESIS
has exactly the same normal distribution as # [it has the same mean and variance, which is su"cient to determine the normal distribution]. Therefore, n!K
b b X #0 # = "2
z i2
i=1
for some i.i.d. standard normal random variables z i . Therefore, #0 #/" 2 is .2n!K by the deÞnition of a chi-squared random variable.
bb
• Furthermore, under H 0 ,
b
c0 ! " ( = c 0 (X 0 X )!1 X 0 # and # = M X #
b
are mutually uncorrelated since
E [M X ##0X (X 0 X )!1 c] = " 2M X X (X 0 X )!1 c = 0. Under normality, uncorrelatedness is equivalent to independence. • We can now base test of H 0 on
b
c0 ! " ( T = , s(c0 (X 0 X )!1 c)1/2 using the t n!k distribution for an exact test under normality. Can test either one-sided and two-sided alternatives, i.e., reject if |T | & tn!K ('/2) [two-sided alternative] or if T & tn!K (') [one-sided alternative]. • Above is a general rule, and would require some additional computations in addition to ! . Sometimes one can avoid this: if computer automatically prints out results of hypothesis for ! i = 0, and one can redesign the null regression suitably. For example, suppose that
b
H 0 : ! 2 + ! 3 = 1.
40
CHAPTER 3. HYPOTHESIS TESTING
Substitute the restriction in to the regression yi = ! 1 + ! 2 xi + ! 3 z i + ui , which gives the restricted regression yi " z i = ! 1 + ! 2 (xi " z i ) + ui . Now test whether ! 3 = 0 in the regression yi " z i = ! 1 + ! 2 (xi " z i ) + ! 3 z i + ui .
3.4
Test of a Multiple Linear Hypothesis
• We now consider a test of the multiple hypothesis R! = r. DeÞne the quadratic form
b £ b
F = (R! " r)0 s2 R(X 0 X )!1 R0
¤ b b !1
(R! " r)/q
!1
(R! " r)0 [R(X 0 X )!1R0 ] (R! " r)/q = . (n " K )s2 /(n " K ) • If y $ N (X ! , "2 I ), then F =
.2q /q .2n!K /(n " K )
$ F (q, n " K )
under H 0 . The rule is that if F & F $(q, n " K ), then reject H 0 at level '. Note that we can only test against a two-sided alternative R ! = 6 r because we have squared value above. • Examples — Standard F —test, which is outputed from computer, is of the hy-
pothesis ! 2 = 0, . . . , ! K = 0,
where the intercept ! 1 is included. In this case, q = K " 1, and H 0 : R! = 0,
3.4. TEST OF A MULTIPLE LINEAR HYPOTHESIS
where
*0 + R = , ...
I K !1
0
41
./ .
The test statistic is compared with critical value from the F (K " 1, n " K ) distribution. — Structural Change. Null hypothesis is
y = X ! + u. Alternative is y1 = X 1 ! 1 + u1 , y2 = X 2 ! 2 + u2 ,
i * n1, i & n2,
where n = n1 + n2 . Let y = ! " =
µ¶ · ¸ µ¶ µ¶ y1 y2
! 1 ! 2
X 1 0 , 0 X 2 u1 , u = . u 2 2K ×1 n×1 , X " =
Then, we can write the alternative regression as y = X "! " + u. Consider the null hypothesis H 0 : ! 1 = ! 2 . Let . RK ×2K = [I K .. " I K ]. Compare with F (K, n " 2K ). • ConÞdence interval is just critical region centred not at H 0 , but at a function of parameter estimates. For example,
b
c0 ! ± t$/2 (n " K )s{c0 (X 0 X )!1 c}1/2
is a two-sided conÞdence interval for the scalar quantity c0 ! . Can also construct one-sided conÞdence intervals and multivariate con Þdence intervals.
42
CHAPTER 3. HYPOTHESIS TESTING
3.5
Test of Multiple Linear Hypothesis Based on Þt
• The idea behind the F test is that under H 0 ,
b
R! " r should be stochastically small, but under the alternative hypothesis it will not be so. • An alternative approach is based on Þ t. Suppose we estimate ! subject to the restriction R ! = r, then the sum of squared residuals from that regression should be close to that from the unconstrained regression when the null hypothesis is true [but if it is false, the two regressions will have di! erent Þtting power]. • To understand this we must investigate the restricted least squares estimation procedure. — Unrestricted regression:
min(y " Xb)0 (y " Xb) b
b b b b b
! , u = y " X ! , Q = u0 u.
— Restricted regression:
min(y " Xb)0 (y " Xb) s.t. Rb = r. b
! " , u" = y " X ! " , Q" = u" u" 0
• The idea is that under H 0, Q" $ Q, but under the alternative the two quantities di! er. The following theorem makes this more precise. • Theorem. Under H 0 , Q" " Q n " K = F $ F (q, n " K ). Q q
3.5. TEST OF MULTIPLE LINEAR HYPOTHESIS BASED ON FIT 43
• Proof . We show that
b £
Q" " Q = (R! " r)0 R(X 0 X )!1 R0 Then, since
s2 = Q/(n " K )
¤ b !1
(R! " r)
the result is established. • To solve the restricted least squares problem we use the Lagrangean method. We know that ! " and + " solve the Þrst order condition of the Lagrangean 1 L(b, +) = (y " Xb)0 (y " Xb) + +0 (Rb " r). 2 The Þrst order conditions are "X 0 y + X 0 X ! " + R0 +" = 0(1)
R! " = r.(2) Now, from (1) R0 +" = X 0 y " X 0 X ! " = X 0 u" , which implies that (X 0 X )!1R0 +" = (X 0 X )!1 X 0 y " (X 0 X )!1 X 0 X ! "
b
= ! " ! " and
b
b
R(X 0 X )!1 R0+" = R ! " R! " = R ! " r. Therefore,
£
¤ b £ ¤ b
+" = R(X 0 X )!1 R0
and
b
!1
(R! " r)
! " = ! " (X 0 X )!1 R0 R(X 0 X )!1 R0
!1
(R! " r).
This gives the restricted least squares estimator in terms of the restrictions and the unrestricted least squares estimator. From this relation we can derive the statistical properties of the estimator ! " .
44
CHAPTER 3. HYPOTHESIS TESTING
• We now return to the testing question. First, write
b b
! " = ! + ! " " !
and
(y " X ! " )0 (y " X ! " )
b b b b b b
b b b b b b b b b b b b £ ¤ b
= [y " X ! " X (! " " ! )]0 [y " X ! " X (! " " ! ] = (y " X ! )0 (y " X ! ) + (! " ! " )0 X 0 X (! " ! " ) "(y " X ! )0 X (! " " ! )
= u0 u + (! " ! " )0 X 0 X (! " ! " )
using the orthogonality property of the unrestricted least squares estimator. Therefore, Q" " Q = (! " ! " )0 X 0 X (! " ! " ).
Substituting our formulae for ! " ! " and +" obtained above and cancelling out, we get Q" " Q = (R! " r)0 R(X 0 X )!1 R0 as required.
!1
(R! " r)
• An intermediate representation is 0
Q" " Q = + " R(X 0 X )!1 R0 +". This brings out the use of the Lagrange Multipliers in deÞning the test statistic, and lead to the use of this name. • Importance of the result: the Þt version was easier to apply in the old days, before fast computers, because one can just do two separate regressions and use the sum of squared residuals. Special cases:
3.5. TEST OF MULTIPLE LINEAR HYPOTHESIS BASED ON FIT 45 — Zero restrictions
! 2 = · · · = ! K = 0
Then restricted regression is easy. In this case, q = K " 1. Note that the R2 can be used to do an F —test of this hypothesis. We have Q Q" " Q 2 R =1" " = , Q Q which implies that R2 /(K " 1) F = . (1 " R2 )/(n " k) — Structural change. Allow coe"cients to be di! erent in two periods.
Partition
µ¶ ¾ · ¸µ ¶
y = y1 = X 1 ! 1 + u1 y2 = X 2 ! 2 + u2
y1 n1 y2 n2
or y =
X 1 0 0 X 2
! 1 ! 2
+ u.
Null is of no structural change, i.e.,
H 0 : ! 1 = ! 2 , with
. R = (I .. " I ).
• Consider the more general linear restriction ! 1 + ! 2 " 3! 4 = 1 ! 6 + ! 1 = 2.
Harder to work with. Nevertheless, can always reparameterize to obtain restricted model as a simple regression. Partition X, ! , and R X = ( X 1 , X 2 ) ; n×(k!q ) n×q
! =
R = ( R1 , R2 ) ;
µ¶
q ×(k!q ) q ×q
! 1 , ! 2
where X 1 ! 1 + X 2! 2 = X ! ; where R 2 is of full rank and invertible.
R1 ! 1 + R2 ! 2 = r,
46
CHAPTER 3. HYPOTHESIS TESTING
• Therefore, ! 2 = R2!1 (r " R1 ! 1 )
X ! = X 1! 1 + X 2 [R2!1 (r " R1 ! 1 )] = (X 1 " X 2 R2!1 R1 )! 1 + X 2 R2!1 r, so that y " X 2 R2!1 r = (X 1 " X 2R2!1 R1 )! 1 + u. • In other words, we can regress 1 y " = y " X 2 R! 2 r
on 1 X 1" = (X 1 " X 2 R! 2 R1 )
to get ! "1 , and then deÞne ! "2 = R2!1 (r " R1 ! "1 ).
We then deÞne u" = y " X 1 ! "1 " X 2 ! "2 and Q " accordingly.
3.6
Examples of F —Tests, t vs. F
• Chow Tests: Structural change with intercepts. The unrestricted model is
µ ¶· y1 y2
=
i1 0 x1 0 0 i2 0 x2
¸
!' "" ' # !
1 2
1
! 2
$ %% + &
µ ¶ u1 u2
,
and let - = ('1 , '2 , ! 1 , ! 2 ). Di! erent slopes and intercepts allowed. • The Þrst null hypothesis is that the slopes are the same, i.e., H 0 : ! 1 = ! 2 = ! .
3.6. EXAMPLES OF F —TESTS, T VS. F
47
The restricted regression is
µ ¶· y1 y2
=
i1 0 x1 0 i2 x2
¸ !#
'1 '2 !
$ &+
µ ¶ u1 u2
.
The test statistic is 0
(u" u" " u0 u)/ dim(! 1 ) F = , u0 u/(n " dim(-))
b b b b
which is compared with the quantiles from the F (dim(! 1 ), n " dim(-)) distribution. • The second null hypothesis is that the intercepts are the same, i.e., H 0 : '1 = ' 2 = ' . Restricted regression ( ', ! 1 , ! 2 )
µ ¶· y1 y2
=
i1 x1 0 i2 0 x2
¸ !#
' ! 1 ! 2
$ &+
µ ¶ u1 u2
.
Note that the unrestricted model can be rewritten using dummy variables: yi = ' + ! xi + ( Di + ) xi Di + ui , where Di = Then, in period 1
½
1 in period 2 0 else.
yi = ' + ! xi + ui , while in period 2 yi = ' + ( + ( ! + ) )xi + ui . The null hypothesis is that ( = 0.
48
CHAPTER 3. HYPOTHESIS TESTING
• But now suppose that n2 < K. The restricted regression is ok, but the unrestricted regression runs into problems in the second period because n2 is too small. In fact, u2 - 0. In this case we must simply acknowledge the fact that the degrees of freedom lost are n2 not K . Thus (Q" " Q)/n2 F = $ F (n2 , n1 " K ) Q/(n1 " K ) is a valid test in this case.
b
3.7
Likelihood Based Testing
• We have considered several di ! erent approaches which all led to the F test in linear regression. We now consider a general class of test statistics based on the Likelihood function. In principle these apply to any parametric model, but we shall at this stage just consider its application to linear regression. • The Likelihood is denoted L(y, X ; -), where y, X are the observed data and - is a vector of unknown parameter. The maximum likelihood estimator can be determined from L(y, X ; -), as we have already discussed. This quantity is also useful for testing. • Consider again the linear restrictions H 0 : R- = r. — The unrestricted maximum likelihood estimator of - is denoted by -
b
— the restricted MLE is denoted by - " , [this maximizes L subject to the restrictions R - " r = 0].
• Now deÞne the following test statistics:
" b
LR : 2 log
b # n b ¯¯
L(-) = 2{log L(-) " log L(-")} " L(- )
Wald : (R- " r) LM :
$ log L $-
0
0
#
!
!1
0
RH (- ) R
H (-" )!1
b o b ¯¯ !1
$ log L $-
#
(R- " r)
!
,
49
3.7. LIKELIHOOD BASED TESTING
where
$ 2 log L H (-) = " $-$-0
¯¯
#
— The Wald test only requires computation of the unrestricted esti-
mator
— the Lagrange Multiplier only requires computation of the restricted
estimator. — The Likelihood ratio requires computation of both. — There are circumstances where the restricted estimator is easier to
compute, and there are situations where the unrestricted estimator is easier to compute. These computational di! erences are what has motivated the use of either the Wald or the LM test. — When it comes to nonlinear restrictions g(-) = 0, the LR test has
the advantage that it is invariant to the parameterization, while the Wald test is a! ected by the way in which the restrictions are expressed. • In the linear regression case, - = (! , " 2), and the restrictions only apply to ! , so that R! = r. Therefore, we can replace the derivatives with respect to - by derivatives with respect to ! only. • The log-likelihood is log L(-) =
"n
2
log2% "
n 1 log " 2 " 2 u(! )0 u(! ) 2 2"
and its derivatives are $ log L $! $ log L $" 2 $ 2 log L $!$! 0 $ 2 log L ($" 2 )2 $ 2 log L $!$"2
=
1 "2
X 0 u(! )
n 1 0 ! + u( ) u(! ) 2" 2 2"4 "1 0 = X X 2 = " "
n 2 u(! )0 u(! ) " 4 6 2" 2" 1 = " 4 X 0 u(! ). =
"
50
CHAPTER 3. HYPOTHESIS TESTING
• The Wald test is
b £
b ¤ b
W = (R! " r)0 R(X 0 X )!1 R0 "2 Q" " Q = , (Q/n) where
!1
(R! " r)
"2 = Q/n
b
is the MLE of " 2 .
— The Wald test statistic is the same as the F —test apart from the use of " 2 instead of s 2 and a multiplicative factor q . In fact,
b
W = qF
n n"k
.
This is approximately equal to qF when the sample size is large. • The Lagrange Multiplier or Score or Rao test statistic is 0
LM =
½ ¾
u" X X 0 X " "2
!1
" "2
X 0 u" ""2
,
where " "2 = Q" /n. — Recall that
X 0 u" = R 0 +" . Therefore, 0
+" R(X 0 X )!1 R0 +" LM = , " "2
where +" is the vector of Lagrange Multipliers evaluated at the optimum. — Furthermore, we can write the score test as
µ ¶
Q" " Q Q LM = = n 1 " (Q" /n) Q"
.
When the restrictions are the standard zero ones, the test statistic is n times the R 2 from the unrestricted regression.
51
3.7. LIKELIHOOD BASED TESTING
• The Likelihood Ratio n n 1 0 n 1 u0 u n 2 log L(! , " ) = " log 2% " log " " 2 u u = " log 2% " 2 log " 2 2 2 n 2 2" 2" and
b b
2
b b b
b b b b
0
n n 1 "0 " n n u" u" n "2 log L(! , " ) = " log 2% " log " " "2 u u = " log 2%" log " . 2 2 2" 2 2 n 2 "
"2
These two lines follow because " 2 = u0 u/n and ""2 = u"0u" /n.
Therefore,
b b b b b ·
¸
L(! , "2 ) Q" Q " LR = 2 log = n log log = n[log Q " " log Q]. L(! , "2 ) n n • Note that W, LM, and LR are all monotonic functions of F, in fact W = F
qn
, n"k W LM = , 1 + W/n
µ ¶
LR = n log 1 +
W . n
If we knew the exact distribution of any of them we can obtain the exact distributions of the others and the test result will be the same. • However, in practice one uses asymptotic critical values, which lead to di! erences in outcomes.We have LM * LR * W, so that the Wald test will reject more frequently than the LR test and the LM tests, supposing that the same critical values are used. • Also, qF * W
52
CHAPTER 3. HYPOTHESIS TESTING
Chapter 4 Further Topics in Estimation: 4.1
Omission of Relevant Variables
• Suppose that y = X 1! 1 + X 2 ! 2 + u, where the error term obeys the usual conditions. • Suppose however that we regress y on X 1 only. Then,
b
! 1 = (X 10 X 1 )!1 X 10 y
= (X 10 X 1 )!1 X 10 (X 1 ! 1 + X 2 ! 2 + u) = ! 1 + (X 10 X 1 )!1 X 10 X 2 ! 2 + (X 10 X 1 )!1 X 10 u,
so that
b
E (! 1 ) = ! 1 + (X 10 X 1)!1 X 10 X 2 ! 2 = ! 1 + ! 12 , where
! 12 = (X 10 X 1 )!1 X 10 X 2 ! 2 .
b
In general ! 1 is biased and inconsistent; the direction and magnitude of the bias depends on ! 2 and on X 10 X 2 . • Example. Wages on education get positive e ! ect but are omitting ability. If ability has a positive e ! ect on wages and is positively correlated with education this would explain some of the positive e ! ect. Wages on race/gender (discrimination). Omit experience/education. 53
54
CHAPTER 4. FURTHER TOPICS IN ESTIMATION:
• What about variance? In Þxed design, variance is " 2(X 10 X 1 )!1 , which is smaller than when X 2 is included. Therefore, if MSE is the criterion, one may actually prefer this procedure - at least in Þnite samples. • Estimated variance is s2 (X 10 X 1 )!1, where y 0 M 1 y = n " K 1 (X 2 ! 2 + u)0 M 1 (X 2 ! 2 + u) = , n " K 1
2
s
which has expectation ! 02 X 20 M 1 X 2 ! 2 E (s ) = " + n " K 1 2
2
& "2 ,
since M 1 is a positive semi-deÞnite matrix.
b
— Therefore, the estimated variance of ! 1 is upwardly biased.
b
• If X 10 X 2 = 0, then ! is unbiased, but standard errors are still biased with expectation ! 02 X 20 X 2 ! 2 2 " + . n " K 1 In this special case, the t—ratio is downward biased. • More generally, t—ratio could be upward or downward biased depending of course on the direction of the bias of ! 1 .
b
• Some common examples of omitted variables — Seasonality — Dynamics — Nonlinearity.
55
4.2. INCLUSION OF IRRELEVANT VARIABLES
• In practice we might suspect that there are always going to be omitted variables. The questions is: is the magnitude large and the direction unambiguous? To address this question we Þrst look at the consequences of including too many variables in the regression.
4.2
Inclusion of irrelevant variables
• Suppose now that y = X 1 ! 1 + u, where u obeys the usual conditions. • However, we regress y on both X 1 and X 2. Then
b
! 1 = (X 10 M 2 X 1 )!1 X 10 M 2 y
= ! 1 + (X 10 M 2 X 1)!1 X 10 M 2 u
• Therefore
b b
E (! 1 ) = ! 1 all ! 1 var(! 1 ) = "2 (X 10 M 2 X 1 )!1 . • Compare this with the variance of y on X 1 , which is only " 2 (X 10 X 1 )!1 . Now X 10 X 1 " X 10 M 2X 1 = X 10 P 2 X 1 & 0 which implies that (X 10 X 1 )!1 " (X 10 M 2 X 1 )!1 * 0. Always better o! , as far as variance is concerned, with the smaller model. • We can generalize the above discussion to the case where we have some linear restrictions R ! = r. In which case, the restricted estimator is
b
b
! " = ! " (X 0 X )!1 R0 [R(X 0 X )!1 R0]!1 (R! " r)
If we estimate by restricted least squares we get smaller variance but if the restriction is not true, then there is a bias.
56
CHAPTER 4. FURTHER TOPICS IN ESTIMATION:
• There is clearly a trade-o! between bias and variance. • The above arguments suggest that including irrelevant variables never leads to bias, but this is not correct. We relied above on the assumption that the included regressors are all Þ xed and therefore the error term is uncorrelated with them. Clearly, if one of the included right hand side variables was say y, then you would deÞnitely get a biased estimate of the coe"cient on the remaining variables.
4.3
Model Selection
• Let M be a collection of linear regression models obtained from a given set of K regressors X = (X 1 , . . . , XK ), e.g., X, X 1 , ( X 2 , X 27 ),etc. Suppose that the true model lies in M. There are a total of (2K " 1) di! erent subsets of X, i.e., models. • Let K j be the number of explanatory variables in a given regression. The following criteria can be used for selecting the ‘best’ regression: 2
R j = 1 "
n"1 n " 1 u j u j (1 " R j2 ) = 1 " , n " K j n " K j u0 u
b b µ ¶ b b b b b b
u j0 u j P C j = n " K j
1+
K j n
b b
u j0 u j 2K j AIC j = ln + n n u j0 u j K j log n BI C j = ln + . n n The Þrst criterion should be maximized, while the others should be 2 minimized. Note that maximizing R j is equivalent to minimizing the unbiased variance estimate u j u j /(n " K j ). • It has been shown that all these methods have the property that the selected model is larger than or equal to the true model with probability tending to one; only BI C j correctly selects the true model with probability tending to one. • M may be large and computing 2 K " 1 regressions infeasible.
57
4.4. MULTICOLLINEARITY
• True model may not be in M, but procedure is guaranteed to Þnd a best model (data mining). • Other criteria are important, especially for nonexperimental data. — Consistency with economic theory elasticities the right sign? De-
mand slopes down? — Consistency with data, e.g., suppose dependent variable is food share % / [0, 1], then ideally don’t want a model that predicts outside this range. — Residuals should be approximately random, i.e., pass diagnostic checks for serial correlation, heteroskedasticity, nonlinearity, etc. — How well model performs out-of-sample. (Often used in time series analysis.) — Correlation is not causation. • An alternative strategy is to choose a large initial model and perform a sequence of t—tests to eliminate redundant variables. Finally, we give a well known result that links the properties of the regression t test and the R squared. 2
• R falls (rises) when the deleted variable has t > (<)1
4.4
Multicollinearity
• Exact multicollinearity: X 0 X is singular, i.e., there is an exact, linear , relationship between variables in X . In this case, cannot deÞne least squares estimates ! = (X 0 X )!1 X 0 y.
b
Solution: Find a minimal (not unique) basis X " for C (X ) and do least squares. • Example : Seasonal dummies D1 D2 D3 D4
= = = =
1ifQuarter 1, 0 1ifQuarter 2, 0 1ifQuarter 3, 0 1ifQuarter 4, 0
else else else else.
58
CHAPTER 4. FURTHER TOPICS IN ESTIMATION:
DeÞne the regressor matrix
!1 1 "" ... 0 "" ... 0 " X = " .. "" . 0 "# 1 ... . . ..
0 0 0 1 0 0 1
0 ... .. ...
0
$ 0% %% 0% %% . 1% %% 0& .
0 ... ..
In this case, for all observations: Col2 + Col3 + Col4 + Col5 = Col1 • Solution — Drop D4, and run
y = ' + ! 1 D1 + ! 2 D2 + ! 3 D3 + u — Drop intercept, and run
y = ( 1 D2 + ( 2 D2 + ( 3D3 + ( 4D4 + u. Gives same y and u, but di! erent parameters. Intuitively, the same vector space is generated by both sets of regressors.
b b
• ‘Approximate Multicollinearity’, i.e., det(X 0 X ) . 0. Informally, if the columns of X are highly mutually correlated then it is hard to get their separate e! ects. This is really a misnomer and shan’t really be treated as a separate subject. Arthur Goldberger in his text on econometrics illustrated this point by having a section on ‘micronumerosity’, a supposed problem where one has too few observations. The consequence of this is that the variance of the parameter estimates is large - precisely the symptom of ‘Approximate Multicollinearity’.
59
4.5. INFLUENTIAL OBSERVATIONS
4.5
Inßuential Observations
• At times one can suspect that some observations are having a large impact on the regression results. This could be a real inßuence, i.e., just part of the way the data were generated, or it could be because some observations have been misrecorded, say with an extra zero added on by a careless clerk. • How do we detect inßuential observations? Delete one observation at a time and see what changes. DeÞne the leave-one-out estimator and residual
b b
! (i) = [X (i)0 X (i)]!1 X (i)0 y(i)
b
u j (i) = y j " X j0 ! (i),
where
y(i) = (y1 , . . . , yi!1 , yi+1, . . . , yn )0
and similarly for X (i). We shall say that observation (X i , yi ) is inßuential if ui (i) is large.
b
• Note that
b
ui (i) = ui " X i0 (! (i) " ! ),
b
so that
E [ui (i)] = 0 var[ui (i)] = " 2 [1 + x0i (X 0 (i)X (i))!1xi ].
b b
Then examine standardized residuals T i =
ui (i) . s(1 + x0i (X 0 (i)X (i))!1 xi )1/2 ú
b
• Large values of T i , in comparison with standard normal, are evidence of extreme observations or outliers. Unfortunately, we do not learn whether this is because the error distribution has a di! erent shape from the normal, e.g., t-distribution, or whether the observation has been misrecorded by some blundering clerk.
60
CHAPTER 4. FURTHER TOPICS IN ESTIMATION:
4.6
Missing Observations
• In surveys, responders are not representative sample of full population. For example, we don’t have information on people with y > $250, 000, y * $5, 000. In this case, n1 ni=1 yi is biased and inconsistent as an estimate of the population mean.
P
• In regression, parameter estimates are biased if selection is: — On dependent variable (or on error term); — Non-random. For example, there is no bias [although precision is
a! ected] if in a regression of inc on education, we have missing data when edu & 5 years. • We look at ‘ignorable’ case where the process of missingness is unrelated to the e! ect of interest. • Missing y yA X A nA ? X B nB . • What do we do? One solution is to impute values of the missing variable. In this case, we might let
b b
b
yB = X B ! A , where ! A = (X A0 X A )!1 X A0 yA.
We can then recompute the least squares estimate of ! using ‘all the data’
· ¸ b µ ¶ b ! = (X 0 X )!1 X 0
X =
X A X B
yA yB
,
.
However, some simple algebra reveals that there is no new information in yB , and in fact ! = ! A .
b
b b
61
4.6. MISSING OBSERVATIONS
• Start from (X 0 X )!1 X 0
· ¸ b
yA = (X A0 X A )!1 X A0 yA , X B ! A
and pre-multiply both sides by
X 0 X = (X A0 X A + X B0 X B ). • Ee have X 0
· ¸ b yA X B ! A
b
= X A0 yA + X B0 X B ! A = X A0 yA + X B0 X B (X A0 X A)!1 X A0 yA
and X 0 X (X A0 X A )!1 X A0 yA = X A0 yA + (X B0 X B )(X A0 X A )!1 X A0 yA. Therefore, this imputation method has not really added anything. It is not possible to improve estimation in this case. • Now suppose that we have some missing X. For example, X = (x, z ), and xB is missing, i.e., we observe (xA, z A , yA) and (z B , yB ). The model for the complete data set is y = ! x + ( z + u with var(u) = " 2u , and suppose also that x = ) z + ² with ² being an iid mean zero error term with var(²) = " 2u . • There are a number of ways of trying to use the information in period B. First, predict x B by regressing xA on z A xB = z B (z A0 z A )!1 z A0 xA .
Then regress y on
b
µ ¶ b b X =
xA z A xB z B
.
62
CHAPTER 4. FURTHER TOPICS IN ESTIMATION:
• The second approach is to write now yA = ! xA + ( z A + uA xA = ) zA + ²A yB = (( + !) )z B + uB + ! ²B , where we have substituted out the xB , which we don’t observe. Now we can estimate ! , ( , and ) from the A observations, denoting these estimates by ! A , ( A , and ) A. Then we have a new regression with
b b b b b
yB " ! A ) A z B = ( z B + eB
for some error term e that includes u B + ! ²B plus the estimation error in ! A ) A . This regression can be jointly estimated with the
b b
b
yA " ! A xA = ( z A + eA .
• This sometimes improves matters, but sometimes does not! The answer depends on relationship between x and z. In any case, the e! ect ! of x is not better estimated; the e! ect of z maybe improved. Griliches (1986) shows that the (asymptotic) relative e "ciency of this approach to the just use A least squares estimator is (1 " +)
µ
2 2 "² 1 + +! 2 "u
¶
,
where + is the fraction of the sample that is missing. E"ciency will be improved by this method when 2 2 "² ! 2 "u
<
1 1"+
,
i.e., the unpredictable part of x from z is not too large relative to the overall noise in the y equation. • Another approach. Let - = ( + !) . Then clearly, we can estimate from the B data by OLS say, call this -B . Then let ( B = -B " ! A ) A . Now consider the class of estimators
b
( (&) = & ( A + (1 " &)( B ,
b b
b
b b b b
63
4.6. MISSING OBSERVATIONS
as & varies. In the homework 2 we showed that the best choice of & is &opt
" 2A " "AB = 2 , " A + " 2B " 2" AB
where in our case " 2A , " 2B are the asymptotic variances of the two esitmators and "AB is their asymptotic vcovariance. Intuitively, unless either " 2A = " AB or "2B = " AB , we should be able to improve matters. • What about the likelihood approach? Suppose for convenience that z is a Þxed variable, then the log likelihood function of the observed data is log f (yA , xA |z A) + log f (yB |z B ).
X
X
A
B
Suppose that u, ² are normally distributed and mutually independent, then
µ ¶ ·µ yA xA
$ N
(( + !) )z A ) zA
£
¶µ ,
" 2u + ! 2 "2² !" 2² " 2²
¤
¶¸
yB $ N (( + !) )z B , " 2u + ! 2 " 2² , which follows from the relations x = ) z + ² and y = (( + !) )z + u + ! ². There are Þve unknown parameters - = (( , ! , ) , " 2u , " 2² ). The likelihood follow from this. • The MLE is going to be quite complicated here because the error variances depend on the mean parameter ! , but it going to be more e "cient than the simple least squares that only uses the A data.
64
CHAPTER 4. FURTHER TOPICS IN ESTIMATION:
Chapter 5 Asymptotics 5.1
Types of Asymptotic Convergence
• Exact distribution theory is limited to very special cases [normal i.i.d. errors linear estimators], or involves very di "cult calculations. This is too restrictive for applications. By making approximations based on large sample sizes, we can obtain distribution theory that is applicable in a much wider range of circumstances. • Asymptotic theory involves generalizing the usual notions of convergence for real sequences to allow for random variables. We say that a real sequence xn converges to a limit x$ , denoted lim n%$ xn = x $, if for all ² > 0 there exists an n 0 such that |xn " x$ | < ² for all n & n0 . • Definition: We say that a sequence of random variables {X n }$ n=1 converges in probability to a random variable X, denoted, P
X n "( X or p lim X n = X. n%$
if for all # > 0, lim Pr[|X n " X | > # ] = 0.
n%$
X could be a constant or a random variable. 65
66
CHAPTER 5. ASYMPTOTICS
• We say that a sequence of random variables {X n}$ n=1 converges almost surely or with probability one to a random variable X, denoted a.s.
X n "( X, if Pr[ lim X n = X ] = 1. n%$
• Definition: We say that a sequence of random variables {X n}$ n=1 converges in distribution to a random variable X, denoted, D
X n "( X, if for all x, lim Pr[X n * x] = Pr[X * x].
n%$
SpeciÞcally, we often have
b
D
n1/2 (- " -) "( N (0, " 2 ) . • Definition: We say that a sequence of random variables {X n}$ n=1 converges in mean square to a random variable X, denoted m.s.
X n "( X, if lim E [|X n " X |2] = 0.
n%$
— This presumes of course that EX n2 < / and EX 2 < /. — When X is a constant,
E [|X n " X |2 ] = E [|X n " EX n|2 ] + |EX n " X |2 = var(X n ) + |EX n " X |2, and it is necessary and su"cient that EX n ( X and var(X n ) ( 0.
5.2. LAWS OF LARGE NUMBERS AND CENTRAL LIMIT THEOREMS 67
• Mean square convergence implies convergence in probability. This follows from the Chebychev inequality Pr[|X n " X | > # ] *
E [|X n " X |2 ] #2
.
• Note that convergence in probability is stronger than convergence in distribution, but they are equivalent when X is a constant (i.e., not random). Almost sure convergence implies convergence in probability, but there is no necessary relationship between almost sure convergence and convergence in mean square. Examples where convergence in distribution does not imply convergence in probability.
5.2
Laws of Large Numbers and Central Limit Theorems
• (Kolmogorov Law of Large Numbers) Suppose that X 1 , . . . , Xn are independent and identically distributed (i.i.d.). Then a necessary and su"cient condition for 1 n
n
X
a.s.
X i "( µ - E (X 1 ),
i=1
is that E (|X i |) < /. • (Lindeberg-Levy Central Limit Theorem) Suppose that X 1 , . . . , Xn are i.i.d. with E (X i ) = µ and var(X i ) = " 2 . Then 1 n1/2
n
X i=1
X i " µ "
D
"( N (0, 1).
• These results are important because many estimators and test statistics can be reduced to sample averages or functions thereof. There are now many generalizations of these results for data that are not i.i.d., e.g., heterogeneous, dependent weighted sums. We give one example
68
CHAPTER 5. ASYMPTOTICS
• (Lindeberg-Feller) Let X 1 , . . . , Xn be independent random variables with E (X i ) = 0 and var(X i ) = "2i . Suppose also that Lindeberg’s condition holds: for all ² > 0,
" Ã !# X X P X P n
1
i
n 2 i=1 " i i=1
Then
E
1
(
X i2 1
X i2
" j2
>²
( 0.
j=1
n
n 2 1/2 i=1 " i ) i=1
D
X i "( N (0, 1).
• A su"cient condition for the Lindeberg condition is that E [|X i |3 ] < /.
69
5.3. ADDITIONAL RESULTS
5.3
Additional Results
• Mann—Wald Theorem. D
— If X n "( X and if g is continuous, then D
g(X n ) "( g(X ). P
— If X n "( ', then
P
g(X n) "( g('). D
P
• Slutsky Theorem. If X n "( X , yn "( ', then: D
— X n + yn "( X + '; D
— X n yn "( 'X ; and D
= 0. — X n /yn "( X/', provided ' 6 • Vector random variables. Consider the vector sequence X n = (X n1 , . . . , Xnk )0 . We have the result that P
kX n " X k "( 0, where kxk = (x0 x)1/2 is Euclidean norm, if and only if P
|X nj " X j | "( 0 for all j = 1, . . . , k . — The if part is no surprise and follows from the continuous mapping
theorem. The only if part follows because if kX n " X k < # then there exists a constant c such that |X nj " X j | < # /c for each j. • Cramers Theorem. A vector X n converges in distribution to a normal vector X if and only if a 0 X n converges in distribution to a 0 X for every vector a.
70
CHAPTER 5. ASYMPTOTICS
5.4
Applications to OLS
• We are now able to establish some results about the large sample properties of the least squares estimator. We start with the i.i.d. random design case because the result is very simple. • If we assume that: — xi , #i are i.i.d. with E (xi #i ) = 0 — 0 < E [xi x0i ] < / and E [kxi #i k] < /. — Then,
b
P
! "( ! .
• The proof comes from applying laws of large numbers to the numerator and denominator of
" # X X b 1 ! " ! = n
!1
n
xi x0i
i=1
1 n
n
xi #i .
i=1
These regularity conditions are often regarded as unnecessary and perhaps strong and unsuited to the Þxed design. • We next consider the ‘bare minimum’ condition that works in the Þxed design case and is perhaps more general since it allows for example trending variables. • Theorem. Suppose that A0-A2 hold and that +min (X 0 X ) ( / as n ( /.
Then, P
b b b
! "( ! .
• Proof. First,
E (! ) = !
for all ! . Then,
var(! ) = " 2 (X 0 X )!1 ,
(?)
71
5.5. ASYMPTOTIC DISTRIBUTION OF OLS
where
° °
(X 0 X )!1 = + max((X 0 X )!1 ) =
1 , +min(X 0 X )
and provided (?) is true,
b ½
var(! ) ( 0. • Suppose that xi = i $ for some ', then
b P
var(! ) =
"2
n 2$ j=1 j
=
O(n!(2$+1) ) if ' 6 = "1/2 . O(1/ log n) if ' = "1/2
Therefore, consistency holds if and only if ' & "1/2. • If we have a random design then the conditions and conclusion should be interpreted as holding with probability one in the conditional distribution given X. Under the above random design assumptions, (?) holds with probability one.
5.5
Asymptotic Distribution of OLS
• We Þrst state the result for the simplest random design case. • Suppose that — xi , #i are i.i.d. with # i independent of x i , — E (#2i ) = " 2 — 0 < E [xi x0i ] < /. — Then, D
b
!1
n1/2 (! " ! ) "( N (0, "2 {E [xi x0i ]} ). • Proof uses Mann—Wald Theorem and Slutsky Theorems. • We next consider the Þxed design case [where the errors are i.i.d. still]. In this case, it su"ces to have a vector central limit theorem for the weighted i.i.d. sequence n
n
n
b X X X xi x0i )!1
! " ! = (
i=1
xi #i =
i=1
i=1
wi #i ,
72
CHAPTER 5. ASYMPTOTICS
for some weights wi depending only on the X data. That is, the source of the heterogeneity is the Þxed regressors. • A su"cient condition for the scalar standardized random variable T n =
(
PP
n i=1 wi #i n 2 2 1/2 i=1 wi " )
to converge to a standard normal random variable is the following condition max1&i&n wi2 ( 0. n 2 w i=1 i This is a so-called negligibility requirement, which means that no one of the weights dominates every other term.
P
• Therefore,
ÃX ! b 1/2
n
xi x0i /" 2
D
(! " ! ) "( N (0, 1),
i=1
provided the following negligibility condition holds: max xi (X 0 X )!1 x0i ( 0 as n ( /.
1&i&n
Actually it su"ces for the diagonal elements of this matrix to converge to zero. This condition is usually satis Þed. • If also X 0 X/n ( M > 0, then
b
D
n1/2 (! " ! ) "( N (0, "2 M !1 ), • Suppose k = 1, then the negligibility condition is max
1&i&n
For example, if xi = i,
P
x2i
n 2 j=1 x j
( 0.
max1&i&n i2 n2 = ( 0. n 2 3) j O(n j=1
P
In this case, even though the largest element is increasing with sample size many other elements are increasing just as fast.
73
5.6. ORDER NOTATION
• An example, where the CLT would fail is xi =
½
1 if i < n n if i = n.
In this case, the negligibility condition fails and the distribution of the least squares estimator would be largely determined by the last observation.
5.6
Order Notation
• In the sequel we shall use the Order notation: X n = o p () n ) if
X n ) n
P
"( 0
and X n = O p () n ) if X n /) n is stochastically bounded, i.e., if for all K, lim Pr
n%$
·¯¯¯
X n ) n
¯¯ ¸
> K < 1.
The latter means that X n is of no larger order than ) n , while the Þrst one is stronger and says that X n is of smaller order than ) n . These concepts correspond to the o(·) and O(·) used in standard real analysis. • The order symbols obey the following algebra, which is really just the Slutsky theorem: O p (1)o p (1) = o p (1) O p (an )O p (bn ) = O p(an bn ) O p(an ) + O p (bn ) = O p(max{an , bn }).
5.7
Standard Errors and Test Statistics in Linear Regression
• We Þrst consider the standard error. We have 2
s
u0u = n"k
b b
74
CHAPTER 5. ASYMPTOTICS
=
1
{u0 u " u0 X (X 0 X )!1 X 0 u}
n"k n u0u 1 u0 X 0 = (X X/n)!1 X 0 u/n1/2. " 1/2 n"k n n"kn
µ ¶
• Theorem. Suppose that ui are i.i.d. with Þnite fourth moment, and that the regressors are from a Þxed design and satisfy (X 0 X/n) ( M,where M is a positive deÞnite matrix. Then D
n1/2(s2 " "2 ) "( N (0, var[u2 " " 2 ]). • Proof. Note that
0 u0 X 2 X X var( 1/2 ) = " , n n
which stays bounded by assumption, so that (u0 X/n1/2 ) = O p (1). Therefore the second term in s2 is O p (n!1 ). Furthermore, u0 u/n converges in probability to " 2 by the Law of Large Numbers. Therefore, s2 = [1 + o p (1)]" 2 "
1 n"k
O p (1)
= " 2 + o p (1). • What about the asymptotic distribution of s2 ? n1/2 (s2 " " 2 ) = [1 + o p (1)] =
1 n1/2 D
n
X
1 n1/2
n
X
(u2i
i=1
n1/2 O p (1) "" )" n"k 2
(u2i " " 2 ) + o p (1)
i=1
"( N (0, var[u2 " " 2 ]),
provided the second moment of (u2i " "2 ) exists, which it does under our assumption. When the errors are normally distributed, var[u2 " " 2 ] = 2" 4 .
75
5.8. THE DELTA METHOD
• Now what about the t statistic: t =
b b
n1/2 c0 ! 1
) s(c0 (X X c)1/2 n 0
"
n1/2 c0 ! = + o p (1) " (c0M !1 c)1/2 N (0, "2 c0 M !1 c) "( - N (0, 1) under H 0 . "(c0 M !1 c)1/2 D
• As for the Wald statistic
" # µ ¶ b b
X 0 X 0 2 W = n(R! " r) s R n
!1
!1
R0
(R! " r).
Theorem. Suppose that R is of full rank, that ui are i.i.d. with Þnite fourth moment, and that the regressors are from a Þxed design and satisfy (X 0 X/n) ( M, where M is a positive deÞnite matrix. Then, D
W "( N (0, " 2 RM !1 R0 ) × [" 2 RM !1 R0 ]!1 × N (0, " 2 RM !1 R0 ) = . 2q .
5.8
The delta method
• Theorem. Suppose that
b
D
n1/2 (- " -) "( N (0,
$)
and that f is a continuously di! erentiable function. Then
µ
$ f $ f D n1/2 (f (-) " f (-)) "( N 0, $ $- $- 0
b b b
¶
.
• Proof (Scalar case). By the mean value theorem
b
f (-) = f (-) + (- " -)f 0 (-" ),
i.e.,
b
n1/2 (f (-) " f (-)) = f 0 (-" ) · n1/2 (- " -).
76
CHAPTER 5. ASYMPTOTICS
• Furthermore,
b
P
P
- "( - 0 -" "( -,
which implies that
P
f 0 (-" ) "( f 0 (-),
where f 0 (-) 6 = 0 < /. • Therefore,
b
b
n1/2 (f (-) " f (-)) = [f 0 (-) + o p (1)]n1/2 (- " -), and the result now follows.
b
• Example 1. f (! ) = e " , what is the distribution of e " (scalar)
b
b
n1/2 (e" " e" ) = e" n1/2 (! " ! ) D
( N (0, e2" " 2 M !1 ).
• Example 2. Suppose that y = ! 1 + ! 2 x2 + ! 3 x3 + u.
b bà ! µ b b
What about ! 2 /! 3 ? We have n1/2 where
! 2
! " 2 ! 3 ! 3
$ f !1 $ f M "( N 0, " 2 $! $! 0 D
! $ f =# $!
0 1/! 3 "! 2 /! 23
¶
$ &,
so that the limiting variance is
(Ã ! µ 1
"
2
" 3 !" 2 " 23
22
23
M M M 32 M 33
¶ Ã !) 1
" 3 !" 2 " 23
.
,
Chapter 6 Errors in Variables • Measurement error is a widespread problem in practice, since much economics data is poorly measured. This is an important problem that has been investigated a lot over the years. • One interpretation of the linear model is that - there is some unobservable y " satisfying y " = X ! - we observe y " subject to error y = y " + #, where # is a mean zero stochastic error term satisfying # # y"
[or more fundamentally, # # X ]. • Combining these two equations y = X ! + #, where # has the properties of the usual linear regression error term. It is clear that we treat X, y asymmetrically; X is assumed to have been measured perfectly. 77
78
CHAPTER 6. ERRORS IN VARIABLES
• What about assuming instead that y = X " ! + #, where X = X " + U. We might assume that X is stochastic but X " is Þxed or that both are random. The usual strong assumption is that U, # # X " in any case, and that U, # are mutually independent. Clearly a variety of assumptions can be made here, and the results depend critically on what is assumed. • Together these equations imply that y = X ! + # " U ! = X ! + / , where / = # " U !
is correlated with X because X (U ) and / (U ). • In this case, the least squares estimator has an obvious bias. We have
b
! = (X 0 X )!1 X 0 y
= ! + (X 0 X )!1 X 0 / = ! + (X 0 X )!1 X 0 # " (X 0 X )!1 X 0 U ! .
Take expectations [note that X is now a random variable, although X " may not be]
b
© ©
0
!1
0
0
!1
0
ª
E (! ) = ! " E (X X ) X E [U |X ] ! "
ª
= ! " E (X X ) X [X " X ] !
In general this is not equal to ! , but it is di"cult to calculate the bias exactly. Instead it is better to work with asymptotic approximation and to obtain an asymptotic bias.
b
• The denominator of ! satisÞes X 0 X X "0 X " X "0 U U 0 U = +2 + . n n n n
79 • We shall suppose that for X "0 X " P "( Q" n "0 X U P "( 0 n U 0 # P "( 0 n X "0 # P "( 0 n U 0 U P "( $UU n which would be justiÞed by the Law of Large Numbers under some assumptions on U, #, X " . Therefore, X 0 X P "( Q" + $UU . n
b
• The numerator of ! satisÞes X 0 # P "( 0 n
X 0 U X "0 U P U 0 U P = "( 0 + "( $UU n n n by similar reasoning. • Therefore,
b
P
!1
! "( ! " [Q" + $UU ]
• In the scalar case, C =
$UU ! =
©
q 1 = , q + " 2u 1 + " 2u /q
where " 2u /q is the noise to signal ratio; - when
b
! is unbiased.
ª
[Q" + $UU ]!1 Q" · ! - C ! .
noise = 1, signal
80
CHAPTER 6. ERRORS IN VARIABLES
- When
noise 1, signal
|bias| increases and ! shrinks towards zero. • In the vector case
b
|| p lim ! || * k! k , n%$
but it is not necessarily the case that each element is shrunk towards zero. • Suppose that K > 1, but only one regressor is measured with error, i.e., " 2u 0 . $UU = 0 0
· ¸
b
In this case, all ! are biased; that particular coe"cient estimate is shrunk towards zero. • The downward bias result is speciÞc to the strong assumptions case. For example, suppose that (X i" , U i , #i ) are normally distributed but that U i , #i are mutually correlated with covariance " u&. Then
b
P
! "(
! q " u& + , q + " 2u q + " 2u
and if " u& is large enough the bias can even be upward. • If X " is trending, then measurement error may produce no bias. For example, suppose that x"t = t and xt = x"t + U t . Now T
"0
"
X X
=
X X
t2 = O(T 3 ),
t=1 T
U 0 U =
t=1
U t2 = O p (T ).
81
6.1. SOLUTIONS TO EIV
Therefore, X 0 X P X "0 X " "( T 3 T 3 X 0 U X "0 U U 0 U P = + 3/2 "( 0. T 3/2 T 3/2 T Therefore,
b
P
! "( ! .
This is because the signal here is very strong and swamps the noise.
6.1
Solutions to EIV
b
• Assume knowledge of signal to noise ratio q/" 2u and adjust ! appropriately. This is hard to justify nowadays because we rarely are willing to specify this information. • Orthogonal regression. • Instrumental variables. Let Z n×k be instruments; that is, we have Z 0 X P "( QZX n Z 0 / P "( 0, n P
P
or equivalently Z 0 #/n "( 0, Z 0 U/n "( 0. • Then deÞne the instrumental variables estimator (IVE)
b
! IV = (Z 0 X )!1 Z 0 y.
• We have
b
! IV = ! +
using the above assumptions.
µ ¶ Z 0 X n
!1
Z 0 / P "( ! , n
82
CHAPTER 6. ERRORS IN VARIABLES
• Suppose that / i are i.i.d. with mean zero and variance " 2' and that in fact Z 0 / D "( N (0, " 2' QZZ ). 1 n2 Then, we can conclude that n
1 2
³ b ´ µ ¶ ¡ ¢ ! IV " ! =
Z 0 X n
!1
Z 0 / 1
n2
D
!1 1 "( N 0, "2' Q! ZX QZZ QZX .
• Where do the instruments come from?
- Suppose that measurement errors a! ects cardinal outcome but not ordinality, i.e., xi < x j 2 x"i < x j" . Then take as z i the rank of xi . - A slightly weaker restriction is to suppose that measurement error does not a! ect whether a variable is below or above the median, although it could a! ect other ranks. In this case, z i =
½
1 if xi > median x i 0 if xi < median x i
would be the natural instrument. - Method of grouping Wald (1940). The estimator is y1 /x1 . - Time series examples, z are lagged variables. - SpeciÞc examples. Month of birth.
6.2
Other Types of Measurement Error
• Discrete Covariates. Suppose that the covariate is discrete, then the above model of measurement is logically impossible. Suppose instead that X i" with prob % X i = " 1 " X i with prob 1 " % . We can write this as X i = X i" + U i , but U i is not independent of X i" .
½
• Magic Numbers. Suppose that there is rounding of numbers so that X i" is continuous, while X i is the closest integer to X i" .
83
6.3. DURBIN-WU-HAUSMAN TEST
6.3
Durbin-Wu-Hausman Test
• We now consider a well known test for the presence of measurement error, called the Durbin-Wu-Hausman test. Actually, the test is applicable more generally. • Suppose that our null hypothesis is H 0 : no measurement error. This is equivalent to " 2U = 0,
which may be a di"cult test to contemplate by our existing methods. • Instead consider the test statistic
b³ b ´ b b³ b ´ bb b
H = ! OLS " ! IV
0
V !1 ! OLS " ! IV ,
and reject the null hypothesis for large values of this statistic. - The idea is that ! OLS and ! IV are both consistent under H 0 , but under H A , ! OLS is inconsistent. Therefore, there should be a discrepancy that can be picked up under the alternative. • What is the null asymptotic variance? We have
b b ©
ª
! OLS " ! IV = (Z 0 X )!1 Z 0 " (X 0 X )!1 X 0 / = A/
with variance V = " 2' AA0 . • In fact, AA0 simpliÞes
AA0 = (Z 0 X )!1 Z 0 Z (Z 0 X )!1 "(Z 0 X )!1 Z 0 X (X 0 X )!1 "(X 0 X )!1 X 0 Z (Z 0 X )!1 +(X 0 X )!1 = (Z 0 X )!1 Z 0 Z (Z 0 X )!1 " (X 0 X )!1 & 0, where the inequality follows by the Gauss Markov Theorem.
84
CHAPTER 6. ERRORS IN VARIABLES
• So we use
b ©
ª X b b
V = s2' (Z 0 X )!1 Z 0 Z (Z 0X )!1 " (X 0 X )!1 = s2' (Z 0 X )!1 Z 0 M X Z (X 0 Z )!1 , where
2
s' =
1 n"k
where
n
/ 2i ,
i=1
/ i = y i " X ! IV .
• Thus
b
b
2 0 0 V !1 = s! ' X Z (Z M X Z )
• Under H 0 ,
!1
D
H "( .2K ,
and the rule is to reject for large values of H .
Z 0 X.
Chapter 7 Heteroskedasticity • We made the assumption that V ar(y) = " 2 I in the context of the linear regression model. This contains two material parts: - o! diagonals are zero (independence), and - diagonals are the same. • Here we extend to the case where
*" + V ar(y) = ,
2 1
0
...
0 " 2n
./ =
$,
i.e., the data are heterogeneous. • We look at the e! ects of this on estimation and testing inside linear (nonlinear) regression model, where E (y) = X ! . In practice, many data are heterogeneous.
7.1
E! ects of Heteroskedasticity
• Consider the OLS estimator
b
! = (X 0 X )!1 X 0 y.
85
86
CHAPTER 7. HETEROSKEDASTICITY
• In the new circumstances, this is unbiased, because
b
E (! ) = ! , • However,
+! .
b ÃX ! X ÃX !
V ar(! ) = (X 0 X )!1 X 0 $X (X 0 X )!1 !1
n
n
xi x0i " 2i
xi x0i
=
i=1 2 0
!1
n
xi x0i
i=1
i=1
!1
= " (X X ) . 6
• As sample size increases,
b
V ar(! ) ( 0, so that the OLSE is still consistent. • The main problem then is with the variance. - Least squares standard errors are estimating the wrong quantity. We have n
X b b X X X X 2
s
1
1 = u u = n"k n
u2i + o p (1)
0
i=1
1 "( " - lim n%$ n P
but 1 n in general.
n
xi x0i " 2i
i=1
n
2
1 " n
n
"2i
i=1
1 · n
" 2i ,
i=1 n
xi x0i
9
0
i=1
• OLS is ine"cient. Why? y" = $!1/2 y =
!1/2
$
X ! + $!1/2 u = X " ! + u" ,
where u" are homogeneous. Therefore,
b
"
!1
! = (X "0 X " )
¡ ¢
X "0 y " = X 0 $!1 X
!1
X 0 $!1 y
87
7.2. PLAN A: EICKER-WHITE
is e"cient by Gauss-Markov. So
b ¡ ¢ b ! GLS = X 0 $!1 X
!1
X 0 $!1 y
is the e"cient estimator here; this is not equal to ! OLS = (X 0 X )!1 X 0 y,
unless
$
= " 2 I
(or some more complicated conditions are satis Þed). • Can show directly that
¡ ¢ X 0 $!1 X
!1
* (X 0 X )!1 X 0 $X (X 0 X )!1 .
In some special cases OLS = GLS , but in general they are di! erent. What to do?
7.2
Plan A: Eicker-White
• Use OLS but correct standard errors. Accept ine"ciency but have correct tests, etc. • How do we do this? Can’t estimate "2i , i = 1,...,n because there are n of them. However, this is not necessary - instead we must estimate the sample average n1 ni=1 xi x0i " 2i . We estimate V ar(! ) =
by
P b ÃX !ÃX !ÃX ! Ã ! Ã ! Ã ! X X X b b X b b 1 V = n
1 n
1 V = n
1 n
!1
n
xi x0i
i=1
!1
n
xi x0i
i=1
1 n
1 n
!1
n
xi x0i " 2i
i=1
!1
n
xi x0i u2i
i=1
Then under regularity conditions 1 n
which shows that
n
P
xi x0i (u2i " " 2i ) "( 0,
i=1
P
V " V "( 0.
1 n
1 n
!1
n
xi x0i
i=1
!1
n
xi x0i
i=1
.
88
CHAPTER 7. HETEROSKEDASTICITY
• Typically Þnd that White’s standard errors [obtained from the diagonal elements of V ] are larger than OLS standard errors.
b
• Finally, one can construct test statistics which are robust to heteroskedasticity, thus D n(R! " r)0 [RV R0 ]!1(R! " r) "( .2J .
7.3
b b b
Plan B: Model Heteroskedasticity
• Sometimes models are suggested by data. Suppose original observations are by individual, but then aggregate up to a household level. Homogeneous at the individual level implies heterogeneous at the household level, i.e., n 1 ui = uij . ni j=1
X i
Then,
"2 1 V ar(ui ) = V ar(uij ) = . ni ni Here, the variance is inversely proportional to household size. This is easy case since apart from single constant, " 2, "2i is known.
• General strategy. Suppose that $
= $(-).
Further example " 2i = e( xi or " 2i = ( x2i for some parameters. • Suppose we have a normal error and that - 3 ! = , . Then, 1 1 L(! , -) = " ln |$(-)| " (y " X ! )0 $(-) (y " X ! ) 2 2 n n 1 1 2 2 2 = " ln " i (-) " (yi " x0i ! ) "! i (- ). 2 i=1 2 i=1 In this case, $ L = $!
X X X n
i=1
1 $ L = " $2
X
(yi " x0i ! ) xi " 2i (-) n
i=1
$ ln " 2i $-
(-)
"
(yi "
x0i ! )2
" 2i (-)
#
"1 .
89
7.4. PROPERTIES OF THE PROCEDURE
b b b " # X b b X b b b
• The estimators (! MLE , -MLE ) solve this pair of equations, which are nonlinear in general. • Note that the equation for ! is conditionally linear, that is suppose that we have a solution -MLE , then !1
n
n
2 xi x0i " ! i (-M LE )
! M LE =
2 xi yi "! i (- MLE ).
i=1
i=1
Iterate. Start with ! OLS , which is consistent, this gives us -, which we then use in the GLS de Þnition. See below for a proper treatment of nonlinear estimators. • Example. Suppose that "2i =
1 - (x0i xi )
for some positive constant -. In this case $ ` 1 = $2
n
n
X X i=1
1
1 + 2
u2i (! )x0i xi .
i=1
Therefore, we have a closed form solution 1
b P b b b b - =
where
1 n
n 2 0 i=1 ui (! )xi xi
,
ui = yi " x0i ! MLE .
7.4
Properties of the Procedure
• Firstly, under general conditions not requiring y to be normally distributed, 1 n 2 ! MLE " ! D "( N (0, %) 1 n 2 -MLE " -
* ,
for some
%.
b b³³ ´´ -/
90
CHAPTER 7. HETEROSKEDASTICITY
• If y is normal, then
%
= I !1 , the information matrix,
I =
·
limn%$ n!1 X 0 $!1X o 0 ?
¸
.
b b ¡ ¢ b b ³ b ´ b b b
In this case, ! is asymptotically equivalent to ! GLS = X 0 $!1 X
!1
X 0 $!1 y.
We say that ! ML is asymptotically Gauss-Markov e "cient, BLAUE. • Often people use ad hoc estimates of - and construct ! FGLS = X 0 $!1 (-AH )X
!1
X 0 $!1 (- AH )y.
P
Provided - "( - and some additional conditions, this procedure is also asymptotically equivalent to ! GLS .
7.5
Testing for Heteroskedasticity
• The likelihood framework has been widely employed to suggest tests of heteroskedasticity. Suppose that " 2i (-) = ' e( x
i
H 0 : ( = 0 vs. ( = 6 0. • The LM tests are simplest to implement here because we only have to estimate under homogeneous null. We have
Xµ ¶ µ¶ X µ bb ¶ " X # X µ bb ¶ $ L 1 =" 2 $(
n
xi
i=1
Under normality,
V ar
Therefore,
n
LM =
i=1
u2i '
u2i '
u2i '
"1 .
= 2.
!1
n
n
xi x0i
" 1 xi 2
i=1
i=1
u2i '
" 1 xi ,
7.5. TESTING FOR HETEROSKEDASTICITY
where
n
b X b 1 ' = n
u2i ,
i=1
where u2i are the OLS residuals from the restricted regression.
b
• Under H 0 Reject for large LM.
D
LM "( .21.
91
92
CHAPTER 7. HETEROSKEDASTICITY
Chapter 8 Nonlinear Regression Models • Suppose that yi = g(xi , ! ) + #i ,
i = 1, 2,...,n,
where #i are i.i.d. mean zero with variance " 2 . • In this case, how do we estimate ! ? The main criterion we shall consider is the Nonlinear least squares, which is of course the MLE when y $ N (g, " 2 I ). In this case one chooses ! to minimize 1 S n (! ) = n
n
X
[yi " g(xi , ! )]2
i=1
over some parameter set B. Let
b
! NLLS =arg min S n (! ). " 'B
• If B is compact and g is continuous, then the minimizer exists but is not necessarily unique. More generally, one cannot even guarantee existence of a solution. • We usually try to solve a Þrst order condition, which would be appropriate for Þ nding interior minima in di ! erentiable cases. In general, the Þrst order conditions do not have a closed form solution. If there are multiple solutions to the Þ rst order condition, then one can end up with di! erent answers depending on the way the algorithm is implemented. Statistical properties also an issue, ! NLLS is a nonlinear function of y, so we cannot easily calculate mean and variance.
b
93
94
CHAPTER 8. NONLINEAR REGRESSION MODELS
• If S n is globally convex, then there exists a unique minimum for all n regardless of the parameter space. Linear regression has a globally convex criterion - it is a quadratic function. Some nonlinear models are also known to have this property.
8.1
Computation
• In one-dimension with a bounded parameter space B, the method of line search is e! ective. This involves dividing B into a grid of, perhaps equally spaced, points, computing the criterion at each point and then settling on the minimum. There can be further reÞnements - you further subdivide the grid around the minimum etc. Unfortunately, this method is not so useful in higher dimensions d because of the ‘curse of dimensionality’. That is, the number of grid points required to achieve a given accuracy increases exponentially in d. • ‘Concentration’ or ‘ProÞling’ can sometimes help: some aspects of the problem may be linear, e.g., g(x, -) = !
x! " 1 +
.
If + were known, would estimate ! by
b
!1
! = [X (+)0 X (+)]
where
* + X (+) = ,
!1 x! 1
!
.. .
x! n !1 !
X (+)0 y,
./ .
Then write S n
³ b ´ X · b 1 ! (+), + = n
n
yi " ! (+)
i=1
x!i " 1 +
¸
2
,
b
which is the concentrated criterion function. Now Þnd + to min this, e.g., by line search on [0, 1].
95
8.1. COMPUTATION
• Derivative based methods. We are trying to Þnd a root of
b³ ´
$ Sn ! NNLS = 0. $!
• Can evaluate S n , $ S n /$! , $ 2 S n /$!$! 0 , for any ! . Suppose we take an initial guess ! 1 and then modify it - which direction and how far? - If $ Sn (! 1 )/$! > 0, then we are to the right of the minimum, should move left. - We Þt a line tangent to the curve $ Sn /$! at the point ! 1 and Þ nd where that line intersects the zero. • The tangent at ! 1 is $ 2 S n (! 1 )/$! 2 and the constant term is $ Sn (! 1 )/$! " $ 2 S n (! 1 )/$! 2 ! 1 . • Therefore, 0=
$ 2 S n $! 2
(! 1 )! 2 +
$ Sn $ 2 S n (! 1 ) " (! 1 )! 1 , $! $!
which implies that ! 2 = ! 1 "
·
$ 2 S n $! 2
¸
!1
(! 1 )
$ Sn (! ). $! 1
Repeat until convergence. This is Newton’s method. • In practice the following criteria are used
¯
¯ ¯
¯
! r+1 " ! r < 0 or Sn (! r+1 ) " S n (! r ) < 0
to stop the algorithm. • In k-dimensions
·
¸
$ 2 S n ! 2 = ! 1 " (! ) $!$! 0 1
!1
$ Sn (! ). $! 1
96
CHAPTER 8. NONLINEAR REGRESSION MODELS
• There are some modiÞcations to this that sometimes work better. Outer produce (OPE) of the scores
"X n
! 2 = ! 1 "
i=1
#X !1
$ Si $ Si (! 1) 0 (! 1 ) $!
$!
n
i=1
$ Si (! ). $! 1
• Variable step length +
·
$ 2 S n ! 2 (+) = ! 1 " + (! ) $!$! 0 1
¸
!1
$ Sn (! ), $! 1
and choose + to max S n (! 2 (+)). • There are some issues with all the derivative-based methods: - If there are multiple local minima, need to try di! erent starting values and check that always converge to same value. - When the criterion function is not globally convex one can have overshooting, and the process may not converge. The variable step length method can improve this. - If their criterion is ßat near the minimum, then the algorithm may take a very long time to converge. The precise outcome depends on which convergence criterion is used. If you use the change in the criterion function then the chosen parameter value may actually be far from the true minimum. - If the minimum is at a boundary point, then the derivative-based methods will not converge. - In some problems, the analytic derivatives are di"cult or time consuming to compute, and people substitute them by numerical derivatives, computed by an approximation. These can raise further problems of stability and accuracy.
8.2
Consistency of NLLS
• Theorem. Suppose that (1) The parameter space B is a compact subset of RK ;
97
8.2. CONSISTENCY OF NLLS
(2) S n (! ) is continuous in ! for all possible data; (3) S n (! ) converges in probability to a non-random function S (! ) uniformly in ! % B, i.e., P
sup |S n(! ) " S (! )| ( 0. " 'B
(4) The function S (! ) is uniquely minimized at ! = ! 0 . - Then
b
P
! ( ! 0 .
• Proof is in Amemiya (1986, Theorem 4.1.1). We just show why (3) and (4) are plausible. Substituting for y i , we have 1 S n (! ) = n 1 = n
n
X X X
[#i + g(xi , ! 0 ) " g(xi , ! )]2
i=1 n
#2i
i=1
1 + n
n
i=1
1 [g(xi , ! ) " g(xi , ! 0 )] + 2 n 2
n
X
#i [g(xi , ! ) " g(xi , ! 0 )] .
i=1
• With i.i.d. data, by the Law of Large numbers 1 n 1 n
n
X X
P
#2i ( " 2
i=1 n
P
#i [g(xi , ! ) " g(xi , ! 0 )] ( 0 for all !
i=1
and for all ! 1 n
n
X
P
¡
¢
[g(xi , ! ) " g(xi , ! 0 )]2 ( E [g(xi , ! ) " g(xi , ! 0 )]2 .
i=1
Therefore, P
¡
¢
S n (! ) ( " 2 + E [g(xi , ! ) " g(xi , ! 0 )]2 - S (! ).
98
CHAPTER 8. NONLINEAR REGRESSION MODELS
• Need convergence in probability to hold uniformly in a compact set containing ! 0 (or over B), which requires a domination condition like sup |S n (! )| * Y with E (Y ) < /. " 'B
• Now S (! 0 ) = " 2 and S (! ) & " 2 for all ! . So, in the limit, ! 0 minimizes S . Need S (! ) to be uniquely minimized at ! 0 (identiÞcation condition). • Example where (4) is satis Þed is where g is linear, i.e., g(xi , ! ) = ! 0 xi . Then S (! ) = " 2 + (! " ! 0 )0 E [xi x0i ](! " ! 0 ), which is a quadratic function of ! . (3) also holds in this case under mild conditions on xi .
8.3
Asymptotic Distribution of NLLS
• Theorem. Suppose that:
b
1. ! is such that
b
$ Sn (! ) =0 $!
and satisÞes
b
P
! ( ! 0 ,
where ! 0 is an interior point of B; 2. $ 2 S n(! )/$!$! 0 exists and is continuous in an open convex neighbourhood of ! 0 ; 3. $ 2 S n(! )/$!$! 0 converges in probability to a Þnite nonsingular matrix A(! ) uniformly in ! over any shrinking neighbourhood of ! 0 ; 4. For some Þnite matrix B, 1
n2
$ Sn (! 0 ) D ( N (0, B) . $!
99
8.3. ASYMPTOTIC DISTRIBUTION OF NLLS
- Then, D
1
b
n 2 (! " ! 0 ) ( N 0, V ), where V = A!1 BA !1 and A = A(! 0). • Proof. We have $ Sn $ 2 S n 1 $ Sn 1 " 2 0 = n (! ) = n 2 (! 0) + 0 (! )n (! " ! 0 ), $! $! $!$! 1 2
b
b
b
where ! " lies between ! 0 and ! by the multivariate mean value theorem. Applying assumptions (1)-(3) we get 1
1
b
n 2 (! " ! ) = "A!1 n 2
$ Sn (! ) + o p (1). $! 0
Finally, apply assumption (4) we get the desired result. • We now investigate the sort of conditions needed to satisfy the assumptions of the theorem. In our case
=
"2
n
n
X X
$ Sn 1 (! 0) = "2 n $!
[yi " g(xi , ! 0 )]
i=1 n
#i ·
i=1
$ g (xi , ! 0 ) $!
$ g (xi , ! 0 ). $!
• Suppose that (xi , #i ) are i.i.d. with E ( #i | xi ) = 0 with probability one. In this case, provided E
·°°
°°¸
$ g $ g #2i (xi , ! 0 ) 0 (xi , ! 0 ) $! $!
< /,
we can apply the standard central limit theorem to obtain D
µ ·
$ Sn $ g $ g n (! 0 ) ( N 0, 4E #2i (xi , ! 0 ) 0 (xi , ! 0) $! $! $! 1 2
¸¶
.
100
CHAPTER 8. NONLINEAR REGRESSION MODELS
• What about (3)? $ 2 S n 1 0 (! ) = "2 n $!$!
n
X i=1
$ 2 g 1 0 (xi , ! ) [yi " g(xi , ! )]+2 n $!$!
n
X i=1
$ g $ g (xi , ! ), $! $! 0
which in the special case ! = ! 0 is $ 2 S n "2 0 (! ) = $!$! n
• Provided
and
n
X ·°° ·°°° i=1
$ 2 g 2 #i 0 (xi , ! 0 ) + $!$! n
°°¸ °°¸
E
$ 2 g #i (xi , ! 0) $!$! 0
E
$ g $ g (xi , ! 0) $! $! 0
n
X i=1
$ g $ g (xi , ! 0 ). $! $! 0
< /
< /,
we can apply the law of large numbers to obtain 1 n 1 n
n
X X i=1 n
i=1
$ 2 g P #i ! (x , ) 0 ( i 0 $!$! 0
·
¸
$ g $ g $ g $ g P (x , ) E (xi , ! 0 ) . ! ( i 0 $! $! 0 $! $! 0
• These conditions need to be strengthened a little to obtain uniformity over the neighbourhood of ! 0 . For example, suppose that we have additional smoothness and $ 2 g
"
2 (xi , !
$!
)=
$ 2 g
"
2 (xi , ! 0 ) + (!
$!
" ! 0 )
$ 3 g $!
for some intermediate point ! "" . Then, provided sup " 'B
°°
$ 3 g
""
3 (x, !
$!
for some function D for which
°°
) * D(x)
ED(X ) < /, condition (2) will be satis Þed.
""
3 (xi , !
)
101
8.4. LIKELIHOOD AND EFFICIENCY
• Similar results can be shown in the Þxed design case, but we need to use the CLT and LLN for weighted sums of i.i.d. random variables. • Note that when #i are i.i.d. and independent of x i , we have
·
¸
$ g $ g B ! A = 4E (x , ) = i 0 $! $! 0 "2
and the asymptotic distribution is
b ¡ ¢ b X b b X b b b b D
1
n 2 (! " ! 0 ) ( N 0, " 2A!1 .
• Standard errors. Let
1 A = n
1 B= n
n
i=1
n
i=1
$ g $ g (xi , ! ) $! $! 0
$ g $ g 2 0 (xi , ! )#i , $! $!
where #i = y i " g(xi , ! ).Then
b
8.4
P
V ( V.
Likelihood and E"ciency
• These results generalize to the likelihood framework for i.i.d. data n
`(data, -) =
X
`i (-).
i=1
b
Let - maximize `(data, -). • Then under regularity conditions
b
P
- ( -0
and
1
b
D
¡
¢
n 2 (- " -0 ) ( N 0, I !1 (-0 ) ,
102
CHAPTER 8. NONLINEAR REGRESSION MODELS
where the information matrix
·
¸ ·
¸
$ `i $ `i $ 2 `i $ `i (-0 ) . I (-0 ) = E (-0 ) = "E $- $- 0 $-$-0
This last equation is called the information matrix equality. • Asymptotic Cramér-Rao Theorem. The MLE is asymptotically “e"cient” amongst the class of all asymptotically normal estimates (stronger than Gauss-Markov).
Chapter 9 Generalized Method of Moments • We suppose that there is i.i.d. data {Z i }ni=1 from some population. • It is known that there exists a unique -0 such that E [g(-0 , Z i )] = 0 for some q × 1 vector of known functions g(-0 , ·). — For example, g could be the
Þrst
order condition from OLS or more generally maximum likelihood, e.g., g(! , Z i ) = xi (yi " x0i ! ).
— Conditional moment speci Þcation. Suppose in fact we know for some given function * that
E [*(-0 , Z i )|X i ] = 0, where X i can be a subset of Z i .Then this implies the unconditional moment given above when you take g(-0 , Z i ) = * (- 0, Z i ) 4 h(X i ) for any function h of the ‘instruments’ X i . This sort of speciÞcation arises a lot in economic models, which is what really motivates this approach. — The functions g can be nonlinear in - and Z .
103
104
CHAPTER 9. GENERALIZED METHOD OF MOMENTS — The distribution of Z i is unspeciÞed apart from the q moments.
• For any -, let 1 Gn (-) = n
n
X
g(-, Z i ).
i=1
• There are several cases: - p > q unidentiÞed case - p = q exactly identi Þed case - p < q overidentiÞed case. • In the exactly identi Þed case, we deÞne our estimator as any solution to the equations Gn (-) = 0.
b
Since we have p equations in p-unknowns, we can expect a solution to exist under some regularity conditions. However, the equations are nonlinear and have to be solved by numerical methods. • When p < q , we cannot simultaneously solve all equations, and the most we can hope to do is to make them close to zero. • Let Qn (-) = G n(-)0 W n Gn (-), where W n is a q × q positive deÞnite weighting matrix. For example, W n = Qq ×q . Then let
b
- GMM minimize Qn (-)
over - % " ' R p .
• This deÞnes a large class of estimators, one for each weighting matrix W n . • It is generally a nonlinear optimization problem like nonlinear least squares; various techniques are available for Þnding the minimizer. • GMM is a general estimation method that includes both OLS and more general MLE as special cases!!
105
9.1. ASYMPTOTIC PROPERTIES IN THE IID CASE
• Thus consider the sample log likelihood n
X
`(Z i , -),
i=1
where exp(`) is the density function of Z i . The MLE maximises the log likelihood function or equivalently Þnds the parameter value that solves the score equations: n
X i=1
$ ` (Z i , -) = 0. $-
• This is exactly identi Þed GMM with g(- , Z i ) =
$ ` (Z i , -). $-
• What is di! erent is really the model speci Þcation part, that is the speciÞcation of models through conditional moment restrictions.
9.1
Asymptotic Properties in the iid case
• We now turn to the asymptotic properties. Under some regularity conditions we have P - GMM ( -0 .
b
Namely, we need that the criterion function converges uniformly to a function that is uniquely minimized by -0 . • Under further regularity conditions, we can establish n where:
1 2
b³
´ ¡
¢
D
-GMM " - ( N 0, (&0 W &)!1 &0 W %W &(&0 W &)!1 ,
%(-0 ) &
1
= Var n 2 Gn (-0 ), $ Gn (-0 ) = p lim . n%$
$-
0 < W = p lim W n . n%$
106
CHAPTER 9. GENERALIZED METHOD OF MOMENTS
• Special case of exactly identi Þed case: weights are irrelevant and
¡
D
1
b
¢
n 2 (- " -0 ) ( N 0, &!1 %&!10 . • What is the optimal choice of W in the overidentiÞed case? - In fact W n should be an estimate of %!1 . - In the iid case we take
e
n
e e Xe e
1 % = %(- ) = n
g(-, Z i )g(-, Z i )0 ,
i=1
where - is a preliminary estimate of -0 obtained using some arbitrary weighting matrix, e.g., I q . • In sum, then, the full procedure is
eb b
- - = arg min Gn (-)0 Gn (-) opt
e
- - = -GMM = arg min Gn (-)0 %!1 Gn (-). • The asymptotic distribution is now normal with mean zero and variance 1
b
¡
D
¢
n 2 (- " -0 ) ( N 0, (&0 %!1 &)!1 . • This estimator is e"cient in the sense that it has minimum asymptotic variance among all GMM estimators.
b h i b b b b b b X b b b b
• Can estimate the asymptotic variance of - by 0
V =
where
&%
1 % = %(- ) = n
and
&
!1
=
&
!1
,
n
g(-, Z i )g(-, Z i )0 ,
i=1
$ Gn (-) , $-
are consistent estimates of & and %.
107
9.2. TEST STA STATISTICS TISTICS
9.2 9.2
Test est St Stat atis isti tics cs
• t-test. Consider the null hypothesis c0 - = ( for some vector c vector c and scalar ( . Then 1
b b
n 2 [c0 - " ( ] (c0 V c)
1 2
D
(0, 1) ( N (0,
under the null hypothesis. Can do one-sided and two-sided tests. • Consider the null hypothesis R- = r, = r, where r where r is of dimensions m dimensions m.. Then
b b b
D
2 n(R- " r)0 [RV R0 ]!1 (R- " r) ( .m .
• Reject for large values.
• Nonlinear restrictions. Suppose that D
1
b
n 2 (- " -) ( N (0, (0, V ( V (-)) for some variance V . V .
• By a Taylor series expansion
b
b
f (-) ' f ' f ((-) + F ( F (-)(- " -), where
F ( F (-) =
$ f ( f (-) . $- 0
• Therefore, n
1 2
³ b ´
D
f ( f (-) " f ( f (-) ( N (0, (0, F ( F (-)V ( V (-)F ( F (-)0 ) .
• This is called the delta method. If f If f is is linear, then this is obvious.
108
CHAPTER CHAPTER 9. GENERALIZE GENERALIZED D METHOD METHOD OF MOMENTS MOMENTS
• Application to hypothesis testing. Consider the null hypothesis f (-) = 0 for some m some m vector nonlinear function f function f .. • Let
b b b b h i b b b b b
f f = f ( f (-) and F F = F ( F (- ).
Then
0
0
nf f F F V V F F
under H 0 .
9.3
!1
D
2 f f ( .m
Examples
• Linear regression y = X = X ! + u, ! + with some error vector u vector u.. • Suppose also that it is known that for some unique ! 0 we have E [ E [xi ui (! 0 )] = 0. 0. There are K are K conditions conditions and K and K parameters parameters and this is an exactly identiÞed case.
b b
! , the OLS estimator in fact, - In this case, there exists a unique ! that satisÞes the empirical conditions
1 0 ! ) = 0. X (y " X ! n • Suppose now
E [ E [xi ui (! 0 )] 6 )] 6 = 0, i.e., the errors are correlated with the regressors. This could be because - omitted omitted variab variables. les. There There are variab variables les in u that should be in X . X . - The includ included ed X variables variables have been measured with error
109
9.3. EXAMPL EXAMPLES ES
(a) Simultaneous Simultaneous Equations. Equations. Demand Demand and supply QS = S (P ; P ; w, r, t) t) QD = D( D (P ; P ; P " , y ) In equilibrium QS = QD determines Q, P P given w, r, t, P " , and y. The econometric model ln Q = ' + ! ln ln P + ) w + *r + 0 t + e , supply ln Q = ' 0 + ! 0 ln P + 1 P P " + 2 y + u , demand Parameters of interest ! , ! 0 price elasticities, 1 cross-price, 2 income. This is a simultaneous system. P , P , Q endogenous Q endogenous variables. w, r, r , t, P " and y exogenous variables. variables. Because P Because P and Q and Q simultaneously determined, expect Cov(P, Cov(P, e) 6 = 0 6 = Cov(Q, Cov(Q, u) Q(P ) P ), P ( P (u) 0 Q(u) P ( P (Q), Q(e) 0 P ( P (e) Simultaneity means we can’t usually use OLS to estimate parameters. • Suppose however that there exists some instruments z i such that E [ E [z i ui (! 0 )] = 0
(9.1)
for some instruments z i % RJ . • Suppose that there are many instruments, i.e., J > K . In this this case case,, ! IV because there are too many equations we can’t solve uniquely for ! which can’t all be satis Þed simultaneously. • Now take
b
1 0 ! ) Z (y " X ! n n 1 = z i (yi " xi0 ! ). n i=1
Gn (! ) =
X
110
CHAPTER 9. GENERALIZED METHOD OF MOMENTS
A GMM estimator can be deÞned as any minimizer of Qn (! ) = (y " X ! )0 ZW n Z 0 (y " X ! ) for some J × J weighting matrix W n. What is the estimator? • We shall suppose that W n is a symmetric matrix and deÞne the real symmetric matrix A = ZW nZ 0 1
and its square root A 2 . Letting 1
1
y " = A 2 y and X " = A 2 X we see that Qn (! ) = (y " " X " ! )0 (y " " X " ! ) with solution !1
b
! GMM = (X "0 X " )
X "0 y " !1 = (X 0 AX ) X 0 Ay !1 = (X 0 ZW n Z 0 X ) X 0 ZW n Z 0 y.
• The question is, what is the best choice of W n ? Suppose also that u has variance matrix " 2 I independent of Z , and that Z is a Þxed variable. Then 0 1 1 0 2 Z Z 2 var n Gn (! 0 ) = var 1 Z u = " . n n2 Therefore, the optimal weighting is to take
h
i
W n 5 (Z 0 Z )!1 in which case
b
! GMM = (X 0 P Z X )
where
!1
X 0 P Z y,
P Z = Z (Z 0 Z )!1 Z 0
i.e., it is the two-stage least squares estimator. • Suppose instead that u i is heteroskedastic, then the optimal weighting is by n 1 W n = z i z i0 u2i . n i=1
X b
111
9.4. TIME SERIES CASE
9.4
Time Series Case
• We next suppose that the data is stationary and mixing. • CONDITIONAL MOMENT RESTRICTIONS. We suppose that for some m × 1 vector of known functions *, with probability one E [*(-0 , Y t) |F t ] = 0 where -0 % R p is the true parameter value and F t is some information set containing perhaps contemporaneous regressors and lagged variables. Many economic models fall into this framework for example Euler equations. In Þnance applications * could be some excess return, and the e"cient markets hypothesis guarantees that this is unforecastable given certain sorts of information. • Examples. — Static time series regression
yt = ! 0 xt + #t , where E (#t |xt ) = 0. In this case, the error term # t can be serially correlated. — Time series regression
yt = ( yt!1 + #t , where E (#t |yt!1 , yt!2 . . .) = 0. In this case, the error term is serially uncorrelated. — Same model but instead suppose only that
E (#t |yt!1 ) = 0. This is strictly weaker than the earlier assumption. — Same model but instead suppose that
E (#t |xt , yt!1 ) = 0. This is strictly weaker than the earlier assumption.
112
CHAPTER 9. GENERALIZED METHOD OF MOMENTS
• Estimation now proceeds by forming some UNCONDITIONAL MOMENT RESTRICTIONS using valid instruments, i.e., variables from F t" ) F t . Thus, let g(-, Z t ) = * (- 0, Y t ) 4 X t , where X t % F t and Z t = (Y t, X t ). We suppose that g is of dimensions q with q & p. Then E [g(-, Z t )] = 0 60 - = - 0 . We then form the sample moment condition 1 GT (-) = T
T
X
g(-, Z t ).
i=1
• If q = p, the estimator solves G T (-) = 0. If q > p, let QT (-) = GT (-)0 W T GT (-), where W T is a q × q positive deÞnite weighting matrix. For example, W T = I q ×q . Then let
b
-GMM
over
minimize
QT (-)
- % " ' R p .
• In the regression case E (#t |xt ) = 0 means that E (#t · h(xt )) = 0 for any measurable function h. Therefore, take g(-, Z t ) = h(xt ) · (yt " ! 0 xt) In the autoregression case E (#t |yt!1, . . .) = 0 means that E (#t · h(yt!1 , . . .)) = 0 for any measurable function h. Therefore, take g(-, Z t) = h(yt!1 , . . .) · (yt " ( yt!1 ) . In this case there are many functions that work.
113
9.5. ASYMPTOTICS
9.5
Asymptotics
• As before we have T where:
1 2
³ b
´ ¡
¢
D
- GMM " - ( N 0, (&0 W &)!1 &0 W %W &(&0 W &)!1 ,
1
%(-0 )
= Var n 2 Gn (-0 ), $ Gn (-0 ) = p lim .
&
n%$
$-
0 < W = p lim W n . n%$
Now, however %(- 0 )
=
=
1
lim varT 2 GT (-0)
T %$
" X # " XX # 1
lim var
T %$
T
1 = lim E T %$ T
1 2
T
g(-0 , Z t )
t=1
T
T
g(-0, Z t )g(-0, Z s )0 .
t=1 s=1
• In the special case where g(-, Z t ) is a martingale with respect to past information, i.e., E [g(-, Z t )|F t!1 } = 0, where Z t % F t , then %(- 0 )
= lim E T %$
"X 1 T
#
T
g(-0 , Z t )g(-0 , Z t )0 .
t=1
• In general though, you have to take account of the covariance terms. If the vector time series U t = g(-0 , Z t ) is stationary, then $
%(- 0 )
= ( 0 +
X k=1
(( k + ( 0k ) ,
114
CHAPTER 9. GENERALIZED METHOD OF MOMENTS
where ( k = E [g(-0 , Z t )g(-0 , Z t!k )0 ] ( 0k = E [g(-0 , Z t!k )g(-0 , Z t )0 ]
is the covariance function of U t . • For standard errors and optimal estimation we need an estimator of %. The Newey-West estimator
b XX %T
=
³e ´ ³e ´
w (|t " s|) g -, Z t g -, Z s
t,s:|t!s|&n(T )
where
w( j) = 1 "
0
,
j , n+1
e
and where - is a preliminary estimate of -0 obtained using some arbitrary weighting matrix, e.g., I q . This ensures a positive deÞnite covariance matrix estimate. Provided n = n(T ) ( / but n(T )/T ( 0 at some rate P %T "( %.
b
• This is used to construct standard errors. • The optimal choice of W should be an estimate of % !1 . We take W T = !1 %T .
b
9.6
Example
• Hansen and Singleton, Econometrica (1982). One of the most inßuential econometric papers of the 1980s. Intertemporal consumption/Investment decision: - ct consumption - u(·) utility u c > 0, ucc < 0. - 1 + ri,t+1 , i = 1, . . . , m is gross return on asset i at time t + 1.
115
9.6. EXAMPLE
• The representative agent solves the following optimization problem $
max
{ct ,wt }t=0 #
X
! ) E [u(ct+) ) |I t ] ,
) =0
where - wt is a vector of portfolio weights. - ! is the discount rate with 0 < ! < 1. - I t is the information available to the agent at time t. • We assume that there is a unique interior solution; this is characterized by the following condition u0 (ct ) = ! E [(1 + ri,t+1 )u0 (ct+1 ) |I t ] , for i = 1, . . . , m. • Now suppose that u(ct ) =
(
c1t " 1!( "
log ct
if ( > 0, ( = 6 1, ( = 1.
Here, ( is the coe"cient of relative risk aversion. • In this case, the Þrst order condition is !(
ct
£
( = ! E (1 + ri,t+1)c! t+1 |I t
for i = 1, . . . , m.
¤
• This implies that
" (
µ ¶) #
ct+1 E 1 " ! (1 + ri,t+1 ) ct
!(
for i = 1, . . . , m, where I t" ) I t and I t" is the econometrician’s information set.
|I t" = 0
116
CHAPTER 9. GENERALIZED METHOD OF MOMENTS
• We want to estimate the parameters - p×1 = (! , ( ) and test whether the theory is valid given a dataset consisting of {ct , ri,t+1, I t"}T t=1 . • DeÞne the q × 1 vector
* ++ g(-, x ) = + , t
· ½
.. .
1 " ! (1 + ri,t+1 ) .. .
³ ´ ¾¸ ct+1 ct
!(
z jt
.. ./ ,
where z t % I t" ) RJ are ‘instruments’, q = mJ , and xt = (z t , ct , ct+1, r1,t+1 , . . . , rm,t+1 )0 . • Typically, z t is chosen to be lagged variables and are numerous, so that q & p. • The model assumption is that E [g(-0 , xt )] = 0 for some unique - 0 . • This is a nonlinear function of ( . • Exercise. Show how to consistently estimate
&
and % in this case.
Chapter 10 Time Series 10.1
Some Fundamental Properties
• We start with univariate time series {yt }T t=1 . There are two main features: - stationarity/nonstationarity - dependence • We Þrst deÞne stationarity. • Strong Stationarity. The stochastic process y is said to be strongly stationary if the vectors (yt, . . . , yt+r ) and (yt+s , . . . , yt+s+r ) have the same distribution for all t, s, r. • Weak Stationarity. The stochastic process y is said to be weakly stationary if the vectors (yt, . . . , yt+r ) and (yt+s , . . . , yt+s+r ) have the same mean and variance for all t, s, r. 117
118
CHAPTER 10. TIME SERIES
• Most of what we know is restricted to stationary series, but in the last 20 years there have been major advances in the theory of nonstationary time series, see below. In Gaussian [i.e., linear] time series processes, strong and weak stationarity coincide. • Dependence. One measure of dependence is given by the covariogram [or correlogram] ( cov(yt , yt!s ) = ( s ; *s = s . ( 0
• Note that stationarity was used here in order to assert that these moments only depend on the gap s and not on calendar time t as well. • For i.i.d. series, ( s = 0 for all s 6 = 0,
while for positively (negative) dependent series ( s > (<)0. Economics series data often appear to come from positively dependent series. • Mixing. (Covariance) If ( s ( 0 as s ( /. • This just says that the dependence [as measured by the covariance] on the past shrinks with horizon. This is an important property that is possessed by many models. • ARMA Models. The following is a very general class of models called ARMA( p, q ): yt = µ + ,1 yt!1 + . . . + , p yt! p +#t " -1 #t!1 " . . . " -q #t!q , where #t is i.i.d., mean zero and variance " 2 . - We shall for convenience usually assume that µ = 0. - We also assume for convenience that this model holds for t = 0, ±1, . . .. • It is convenient to write this model using lag polynomial notation. We deÞne the lag operator Lyt = yt!1
10.1. SOME FUNDAMENTAL PROPERTIES
119
so that we can now deÞne A(L)yt = B(L)#t , where the lag polynomials A(L) = 1 " ,1 L " . . . " , pL p B(L) = 1 " -1 L " . . . " -q Lq . The reason for this is to save space and to emphasize the mathematical connection with the theory of polynomials. • Special case AR(1). Suppose that yt = , yt!1 + #t . Here, A(L) = 1 " ,L. - We assume |,| < 1, which is necessary and su"cient for yt to be a stationary process. - Now write yt!1 = , yt!2 + #t!1 . Continuing we obtain yt = #t + ,#t!1 + ,2 yt!2 = #t + ,#t!1 + ,2 #t!2 + . . . $
=
X
, j #t! j ,
j=0
which is called the MA(/) representation of the time series; - this shows that y t depends on all the past shocks. • Now we calculate the moments of yt using the stationarity property. We have E (yt) = , E (yt!1), which can be phrased as µ = , µ 2 µ = 0, where µ = E (yt ) = E (yt!1 ).
120
CHAPTER 10. TIME SERIES
• Furthermore, var(yt ) = , 2 var(yt!1 ) + " 2, which implies that ( 0 =
"2
1 " ,2
,
where ( 0 = var(yt ) = var(yt!1 ).
This last calculation of course requires that |,| < 1, which we are assuming for stationarity. • Finally, cov(yt, yt!1 ) = E (yt yt!1 ) = , E (yt2!1 ) + 0, which implies that ( 1 = ,
"2
1 " ,2
,
while cov(yt , yt!2 ) = E (yt yt!2 ) = , E (yt!1 yt!2) = , • In general ( s = "
2
,s
2;
"2
2
1 " ,2
*s = , s .
1", The correlation function decays geometrically towards zero. • Exercise calculate correlogram for AR(2). • Moving Average MA(1). Suppose that yt = # t " -#t!1 , where as before # t are i.i.d. mean zero with variance "2 . - In this case, E (yt ) = 0, and var(yt ) = " 2 (1 + - 2 ).
.
10.1. SOME FUNDAMENTAL PROPERTIES
121
- Furthermore, cov(yt , yt!1 ) = E {(#t " -#t!1 ) (#t!1 " -#t!2 )} = "-E (#2t!1 ) = "-" 2 . Therefore, *1 =
"-
, 1 + - 2 = 0, j = 2, . . .
* j
- This is a 1-dependent series. MA(q ) is a q -dependent series. - Note that the process is automatically stationary for any value of -. - If | -| < 1, we say that the process is invertible and we can write $
X
- j yt! j = # t .
j=0
• In general ARMA( p, q ), we can write A(L)yt = B(L)#t . - The stationarity condition for an ARMA( p, q ) process is just that the roots of the autoregressive polynomial 1 " ,1 z " . . . " , pz p to be outside unit circle. - Likewise the condition for invertibility is that the roots of the moving average polynomial 1 " -1 z " . . . " -q z q lie outside the unit circle.
122
CHAPTER 10. TIME SERIES
- Assuming these conditions are satisÞed we can write this process in two di! erent ways: $
X
A(L) ( j yt! j = # t . yt = B(L) j=0 This is called the AR(/) representation, and expresses y in terms of its own past. Or B(L) #t = yt = A(L)
$
X
) j #t! j .
j=0
This is called the MA(/) representation, and expresses y in terms of the past history of the random shocks.
10.2
Estimation
In this section we discuss estimation of the autocovariance function of a stationary time series as well as the parameters of an ARMA model. • Autocovariance. Replace population quantities by sample 1
T
X
( s =
(yt " y)(yt!s " y) T " s t=s+1
*s =
( s . ( 0
b b bb
These sample quantities are often used to describe the actual series properties. Consistent and asymptotically normal. • Box-Jenkins analysis: ‘identi Þcation’ of the process by looking at the correlogram. In practice, it is hard to identify any but the simplest processes, but the covariance function still has many uses. • Estimation of ARMA parameters ,. Can ‘invert’ the autocovariance/autocorrelation function to compute an estimate of ,. For example in the AR(1) case, the parameter * is precisely the Þrst order autocorrelation. In the
123
10.2. ESTIMATION
MA(1) case, can show that the parameter - satisÞes a quadratic equation in which the coe "cients are the autocorrelation function at the Þrst two lags. A popular estimation method is the Likelihood under normality. Suppose that
!# $ "# ... %& $ N (0, " I ), 1
2
#T
then
!y $ "# ... %& $ N (0, 1
$)
yT
for some matrix
$.
- for an AR(1) process
*1 " +, = 1 " ( 2
$
( ( 2 · · · ( T !1
...
...
2
...
1
./ ,
- for an M A(1) process $
*1 + = " (1 + - ) , 2
2
!#
1+#2
...
0
0 ./ .
1
• For general ARMA then, the log likelihood function is ` =
"T
1 1 log2% " log |$| " y0 $!1 y. 2 2 2
Maximize with respect to all the parameters , . • Distribution theory. The MLE is consistent and asymptotically normal provided the process is stationary and invertible. 1
b
D
¡ ¢
!1 T 2 (, " ,) ( N 0, I ** ,
where I ** is the information matrix.
124
CHAPTER 10. TIME SERIES
• In practice, |$| and $!1 can be tough to Þnd. We seek a helpful approach to computing the likelihood and an approximation to it, which is even easier to work with. • The Prediction error decomposition is just a factorization of the joint density into the product of a conditional density and a marginal density, f (x, z ) = f (x |z )f (z ). We use this repeatedly and take logs to give T
` (y1 , . . . , yT ; -) =
X
` (yt |yt!1 , . . . , y1 ) + ` (y1 , . . . , y p ) .
t= p+1
• This writes the log likelihood in terms of conditional distributions and a single marginal distribution. In AR cases the distribution of yt |yt!1 , . . . , y1 is easy to Þnd:
¡
¢
yt |yt!1 , . . . , y1 $ N ,1 yt!1 + . . . + , p yt! p , " 2 . • In the AR(1) case
¡
1 1 `t|t!1 $ " log "2 " 2 (yt " ,1 yt!1 )2 . 2 2"
¢
Also, y1 $ N 0, " 2 /(1 " ,2 ) , i.e., "2 1 (1 " ,2 ) 2 `(y1 ) = " log y1 . " 2 2" 2 1 " ,2
Therefore, the full likelihood in the AR(1) case is T " 1 1 ` = " log "2 " 2 2 2"
T
X
"2 1 1 " ,2 2 (yt " ,yt!1) " log y1 . 2 " 2 " 2 2 , 1 " t=2 2
• Often it is argued that `(y1 ) is small relative to in which case we use T " 1 1 log "2 " 2 " 2 2"
T
X t=2
P
T t=2 ` (yt
(yt " ,yt!1 )2 .
|yt!1 , . . . , y1 ),
125
10.3. FORECASTING
• This criterion is equivalent to the least squares criterion, and has unique maximum T t=2 yt yt!1 , = . T 2 y t=2 t!1 This estimator is just the OLS on yt on yt!1 [but using the reduced sample]. Can also interpret this as a GMM estimator with moment condition E [yt!1 (yt " ,yt!1 )] = 0.
P b P
• The full MLE will be slightly di ! erent from the approximate MLE. In terms of asymptotic properties, the di ! erence is negligible. - However, in Þnite sample there can be signi Þcant di! erences. - Also, the MLE imposes that , be less than one - as , ( ±1, ` ( "/. The OLS estimate however can be either side of the unit circle.
b
10.3
Forecasting
• Let the sample be {y1 , . . . , yT }. Suppose that yt = ( yt!1 + #t ,
|( | < 1,
where we Þrst assume that ( is known. • Want to forecast yT +1 , yT +2 , . . . , yT +r given the sample information. We have yT +1 = ( yT + #T +1 . Therefore, forecast y T +1 by yT +1|T = E [yT +1 |sample] = ( yT .
b
• The forecast error is #T +1 , which is mean zero and has variance " 2 . • What about forecasting r periods ahead? yT +r = ( r yT + ( r!1 #T +1 + . . . + #T +r . Therefore, let yT +r|T = ( r yT
be our forecast.
b
126
CHAPT CHAPTER ER 10. TIME TIME SERIES SERIES
• The forecast error yT + T +r has mean zero and variance T +r|T " yT +
b ¡
¢
" 2 1 + ( 2 + . . . + ( 2r!2 .
• Asymptotically the forecast reverts to the unconditional mean and the forecast variance reverts to the unconditional variance. • In practice, we must use an estimate of ( , so that ( r yT , yT + T +r|T = (
b b
where ( is estimated from sample data. If ( is estimated well, then this ( is will not make much di ! erence. erence.
b
• Forecast interval yT + T +r|t ± 1.96 · DS,
b ¡
¢
SD = SD = " 2 1 + ( 2 + . . . + ( 2r!2 . This is to be interpreted like a con Þdence dence interv interval. Agai Again n we must replace the unknown parameters by consistent estimates. • This theor theory y generali generalizes zes natural naturally ly to AR(2) AR(2) and and higher higher order order AR processes processes in which case the forecast is a linear combination of the most recent observations. The question is, how to forecast for an MA(1) process? yt = # t " -#t!1 = (1 " -L)#t . We must use the AR(/) representation yt 1 " -L
= y t + - yt!1 + . . . = # t .
This means that the forecast for MA processes is very complicated and depends on all the sample y sample y 1 , . . . , yT .
10.4 10.4
Autoc Autocor orrel relat atio ion n and and Regre egress ssio ion n
• Regression models with correlated disturbances yt = ! 0 xt + ut ,
127
10.4. AUTOCO AUTOCORREL RRELA ATION AND REGRESSIO REGRESSION N
where xt is exogenous , i.e., i.e., is determ determine ined d outsid outsidee the system system;; Þxed regressor regressorss are an example. example. There There are a number number of di! erent erent variations on this theme theme - strongly strongly exogenous exogenous and weakly weakly exogenou exogenous. s. A weakly weakly exogenou exogenouss process process could could include include lagg lagged ed dependen dependentt variables. ariables. We will for now assume strong exogeneity. • We also suppose that E (ut us ) 6 = 0 for some s 6 = t. • As an example, suppose that ln GNP GN P = ! 1 + ! 2 time + ut . We expect the deviation from trend, ut , to be positively autocorrelated reßecting ecting the business business cycle, i.e., not i.i.d. i.i.d. Recessio Recession n quarter quarter tends to be followed by recession quarter. • We can write the model in matrix form y = X ! + + u,
* ( + =,
0
0
E (uu ) = $
( 1 ( 2 · · · ( T !1
... ...
( 2 ( 0
./ .
• The conse conseque quence ncess for estim estimati ation on and and testi testing ng of ! are are the the sam same as with heteroskedasticity: OLS is consistent and unbiased, but ine "cient, while the SE’s are wrong. • SpeciÞcally,
b
! ) = (X 0 X )!1 X 0 $X (X 0 X )!1 , var( var (!
where
T
0
3 T = X $X =
T
XX t=1 s=1
xt xt0 ( |t!s| .
128
CHAPT CHAPTER ER 10. TIME TIME SERIES SERIES
• A naive implementation of the White strategy is going to fail here, i.e.,
!u "" = X " #
b bb b b b $%% X X b b %& b 2 1
b
3 T
0
u1 u2 · · · u1uT u21 ...
T
T
xt xt0 ut us
X =
t=1 s=1
2 uT
is inconsisten inconsistent. t. This is basically basically becau b ecause se there are too many random random variables in the sample matrix, in fact order T 2 , whereas in the independent but heterogeneous case there were only order T T terms. • The correct approach is to use some downweighting that concentrates weight weight on a smaller fraction fraction of their terms. Bartlett/White Bartlett/White/Newe /Newey/W y/West est SE’s: Replace by sample equivalents and use weights w( j) j ) = 1 " so that
b X X 3 T =
j , n+1
X t X s0 w (|t " s|) ut us .
b b
t,s: t,s:|t!s|&n(T ) T )
This also ensures a positive deÞnite covariance covariance matrix estimate. estimate. Provides consistent standard errors. • An alternative strategy is to parameterize ut by, say, an ARMA process and do maximum likelihood 1 1 0 ! ) $(-)!1 (y " X ! ! ) . ` = " ln |$(-)| " (y (y " X ! 2 2 • E"cient estimate of ! (under Gaussianity) is a sort of GLS
³ ´ b b b 0
!1
! ! M L = X $(-) X
b
!1
X 0 $(-)!1 y,
- . This will be asymptotically e "cient when the where - is the MLE of chosen parametric model is correct.
129
10.5. TESTING FOR AUTOCORRELATION
10.5
Testing for Autocorrelation
• Suppose that we observe u t , which is generated from an AR(1) process ut = * ut!1 + #t, where #t are i.i.d. • The null hypothesis is that u t is i.i.d., i.e., H 0 : * = 0 vs. H A : * 6 = 0. This is used as (a) general diagnostic, and (b) e"cient markets. • General strategy: use LR, Wald or LM tests to detect departures. • The LM test is easiest, this is based on LM = T
µPP b b ¶ b t
ut ut!1 2 t ut!1
2
D
= T r12 ( .21 ,
where ut are the OLS residuals. Therefore, we reject the null hypothesis when LM is large relative to the critical value from . 21 .
b
• This approach is limited to two-sided alternatives. We can however 1 also use the signed version, T 2 r1 , which satisÞes D
1
T 2 r1 ( N (0, 1) under the null hypothesis. • The Durbin-Watson d is d =
P bP b b
T 2 t=2 (ut " ut!1 ) . T 2 t=1 ut
This is always printed out by many regression packages. • Using the approximation
d ! 2(1 " r1 ), we have [under the null hypothesis] T
1 2
µ ¶ 1"
d 2
( N (0, 1).
130
CHAPTER 10. TIME SERIES
• Generalization (test against AR( p)). Suppose that ut = * 1 ut!1 + . . . + * p ut! p + #t, where #t are i.i.d. The null hypothesis is that u t is i.i.d., i.e., H 0 : * 1 = . . . = * p = 0 vs. H A some * j = 6 0. • Box-Pierce Q P
Q = T
X
D
r j2 ( .2P .
j=1
10.6
Dynamic Regression Models
• We have looked at pure time series models with dynamic response and at static regression models. In practice, we may want to consider models that have both features. • Distributed lag
q
yt = ' +
X
! j X t! j + ut ,
j=0
[could have q = /], where for now iid
ut $ 0,
"2 .
Captures the idea of dynamic response: a! ect on y of change in x may take several periods to work through. • Temporary change. Suppose that xt ( xt + ! but that future xs are una! ected, then yt ( yt + ! 0 ! yt+1 ( yt + ! 1 ! etc.
131
10.6. DYNAMIC REGRESSION MODELS
• Permanent change. Suppose that xs ( xs + !,
+s & t.
Then yt ( yt + ! 0 ! yt+1 ( yt + ( ! 0 + ! 1 )! etc. - The impact e! ect is ! 0 !. - Long run e! ect is ! $ s=0 ! s .
P
• When q is large (inÞnite) there are too many free parameters ! j , which makes estimation di"cult and imprecise. To reduce the dimensionality it is appropriate to make restrictions on ! j . - The polynomial lag ! j =
½
a0 + a1 j + . . . + a p j p 0
if j * p else.
- The Geometric lag ! j = ! + j ,
j = 0, 1, . . .
for some 0 < + < 1. This implies that $
yt = ' + !
X "X #
+ j xt! j + ut
j=0
$
= ' + !
(+ j L j ) xt + ut
j=0
1 = ' + ! xt + ut . 1 " +L Therefore, (1 " +L)yt = ' (1 " +L) + ! xt + (1 " +L)ut, which is the same as yt = ' (1 " +) + +yt!1 + ! xt + ut " +ut!1. The last equation is called the lagged dependent variable representation.
132
CHAPTER 10. TIME SERIES
• More generally [ADL model] A(L)yt = B(L)xt + ut , where A, B are polynomials of order p, q , while C (L)ut = D(L)#t ,
#t i.i.d. 0, " 2 .
This is a very general class of models; estimation, forecasting, and testing have all been worked out at this generality, and one can Þnd accounts of this in advanced time series texts.
10.7
Adaptive expectations
• Suppose that x"t+1
yt = ' + !
|{z} demand
|{z}
+#t ,
expected price
but that the expected price is made at time t and is unobserved by the econometrician. Let • We observe x t, where x"t+1 " x"t
= (1 " +) (xt " x"t ) ,
| {z } | {z } |{z} | {z } forecast error
revised exp ectations
i.e.,
x"t+1 = +x"t
+ (1 " +)xt .
old forecast
• Write
news
(1 " +L)x"t = (1 " +)xt ,
which implies that (1 " +) x"t = xt = (1 " +) xt + +xt!1 + +2 xt!2 + . . . . 1 " +L
£
• Therefore, yt = ' +
! (1 " +) xt + #t , 1 " +L
which implies that yt = + yt!1 + '(1 " +) + ! (1 " +)xt + #t " +#t!1 . This is an ADL with an M A(1) error term.
¤
133
10.8. PARTIAL ADJUSTMENT
10.8
Partial adjustment
• Suppose that yt" = ' + ! xt , where y t" is the desired level. • However, because of costs of adjustment yt " yt!1 = (1 " +)(yt" " yt!1) + #t.
| {z } actual change
• Substituting we get yt = (1 " +)yt" + +yt!1 + #t = '(1 " +) + +yt!1 + ! (1 " +)xt + #t . This is an ADL with an i.i.d. error term - assuming that the original error term was i.i.d.
10.9
Error Correction
• Suppose long run equilibrium is y = + x. • Disequilibria are corrected according to !yt = ! (yt!1
" +xt!1 ) + +!xt!1 + #t ,
where ! < 0. • This implies that yt = y t!1 (1 + ! ) + +(1 " ! )xt!1 " +xt!2 + #t .
134
CHAPTER 10. TIME SERIES
10.10
Estimation of ADL Models
• Suppose that yt = - 1 + -2 yt!1 + - 3 xt + #t , where we have two general cases regarding the error term: (1) #t is i.i.d. 0, " 2 (2) #t is autocorrelated. • In case (1), we can use OLS regression to get consistent estimates of -1 , -2 and -3 . The original parameters are related to the - j in some way, for example -1 = ' (1 " +) -2 = + . -3 = ! (1 " +)
01 2
In this case, we would estimate the original parameters by indirect least squares
b b b
b b b b b
+ = -2
' = ! =
-1
1 " -2 -3
1 " -2
.
• In case (2), we must use instrumental variables or some other procedure because OLS will be inconsistent. - For example, if #t = 2 t " -2 t!1 ,
then y t!1 is correlated with # t through 2 t!1 . In this case there are many instruments: (1) All lagged x t , (2) yt!2 , . . .. - However, when #t = *# t!1 + 2 t , 2 t i.i.d. lagged y are no longer valid instruments and we must rely
on lagged x.
10.11. NONST NONSTATIONAR TIONARY TIME SERIES SERIES MODELS MODELS
135
• There are many instruments; e"ciency considerations require that one has a good way of combining them such as in our GMM discussion. • IV are not generally as e "cient as ML when the error terms are normally distributed.
10.11 10.11
Nonst onstat atio iona nary ry Tim Time e Serie Series s Mode Models ls
• There are many di! erent erent ways in which a time series y series y t can be nonstationary. For example, there may be Þxed seasonal e! ects ects such that m
yt =
X
D jt ( j + ut ,
j=1 j =1
where D where D jt are seasonal dummy variables, i.e., one if we are in season j season j and zero otherwise. If u u t is an iid mean zero error term, m
Ey t =
X
D jt ( j
j=1 j =1
and so varies with time. time. In this case there is a sort of periodic p eriodic movemen movementt in the time series but no ‘trend’. • We We next discuss discuss two two alternat alternativ ivee models models of trending trending data: trend trend stationary and di! erence erence stationary. • Trend stationary. Consider the following process yt = µ = µ + ! t + ut , where { where {u ut } is a stationary mean zero process e.g., A(L)ut = B = B((L)#t with the polynomials A, A , B satisfying the usual conditions required for stationarity and invertibility. This is the trend+stationary decomposition. — We have
Ey t = µ + ! t; var(y var(yt ) = " 2 for all t all t.. The lack of stationarity comes only through the mean.
136
CHAPT CHAPTER ER 10. TIME TIME SERIES SERIES
(ut ) are transitory - they last for some period of time — The shocks (u and then are forgotten as yt returns to trend. — Example. GNP grows at 3% per year (on average) for ever after.
• Di! erence erence stationary I stationary I (1) (1) yt = µ = µ + yt!1 + ut , where { where {u ut} is a stationary process. This is called the random walk plus drift. When µ When µ = = 0, we have the plain vanilla random walk. — We can’t now suppose that the process has been going on for
an inÞnite amount of time, and the starting condition is of some signiÞcance. — We can make two assumptions about the initial conditions:
y0 =
½
Þxed
random Variable N Variable N (0 (0,, v)
for some variance v. ects — Any shocks have permanent a! ects yt = y0 + tµ +
P t
us .
s=1
The di! erenced erenced series is than !yt
= y t " yt!1 = µ = µ + ut .
• Both the mean and the variance of this process are generally explosive. Ey t = y = y 0 + tµ; tµ;
var y ar y t = " 2 t.
If µ = 0, the mean does not increase over time but the variance does. • Note that di! erencing erencing in the trend stationary case gives !yt
= ! + + ut + ut!1 ,
which is a unit root MA. So although di ! erencing erencing apparently eliminates stationarity it induces non-invertibilit non-invertibility y. Likewise Likewise detrending the di! erence erence stationary case is not perfect.
137
10.12. ESTIMA ESTIMATION
• A model that nests both trend stationary and di ! erence erence stationary is yt = µ + ! t + ut , ut = * ut!1 + 2 t , where 2 t is a stationary ARMA process. We have yt = µ + ! t + *(yt!1 " µ + ! (t " 1)) + 2 t , When * = 1 and ! = 0 we get the random walk plus drift.
10.1 10.12 2
Esti Estima mati tion on
• E! ects ects of time trend on estimation: - you you get superconsi superconsisten stentt T 3/2 estimates of ! , but still Gaussia Gaussian n t-tests still valid. • E! ects ects of unit root: - superconsistent superconsistent estimates, but with nonstandard distributions: distributions: ttests not valid! • Suppose that yt = * yt!1 + ut , where u where u t $ 0, " 2. Then,
P b P b b *OLS =
- If | | *| < 1 < 1
T t=2 yt yt!1 P ( T 2 y t=2 t!1
*,
+*.
1
T 2 (* " *) ( N (0, (0, 1 " *2 ).
- If * = 1, 1 " *2 = 0, so the im impli plied ed varia variance nce abov above is zero. So what happens in this case? If * = 1, D
T ( T (* " *) ( X,
where X is is not Gaussian; it is asymmetric and in fact E (*) < 1, 1 , T . The rate of conve convergen rgence ce is faster but the asymptoti asymptoticc distribdistrib+T . ution is non standard. - Dickey-F Dickey-Fuller uller (1981) derived the distribution distribution of * and the distribution of the corresponding t-statistic, t+ , when * = 1, and they tabulated it.
b
b
138
CHAPTER 10. TIME SERIES
10.13
Testing for Unit Roots
• Suppose that yt = µ + *yt!1 + ut , where the process u t is i.i.d. By taking di ! erences we obtain !yt = µ
+ ( yt!1 + ut
with ( = * " 1. • To test whether * = 1 is equivalent to testing ( = 0 in the model !yt
= µ + ( yt!1 + ut.
We do a one-sided test H 0 : ( = 0 vs ( < 0 because the explosive alternatives are not interesting. • Dickey and Fuller (1979) tabulated the distribution of the least squares estimator ( and its associated t-test in the case that * = 1 i.e., ( = 0. This is exactly the null case. Their critical values can be used to do the test. Large negative values of the test statistic are evidence against the null hypothesis.
b
• The critical values are "3.96 and "3.41 at the 1% and 5% levels respectively. • If you do it without the intercept, i.e., run the regression !yt = ( yt!1 +
ut .
the critical values are "3.43 and "2.86 at the 1% and 5% levels respectively. This assumes that the null hypothesis is the driftless random walk. • Can also do a test based on the raw estimates.
139
10.14. COINTEGRATION
• The DF test is only valid if the error term ut is i.i.d. Have to adjust for the serial correlation in the error terms to get a valid test. The Augmented D-F allows the error term to be correlated over time upto a certain order. Their test is based on estimating the regression
P
p!1
!yt
= µ + ( yt!1 +
j=1
, j !yt! j + 2 t
by least squares and using the ADF critical values for ( or rather the t-ratio.
b
• Can also add trend terms in the regression. Phillips-Perron test (PP) is an alternative way of correcting for serial correlation in u t . • Applications
10.14
Cointegration
• Suppose y t and xt are I (1) but there is a ! such that yt " ! xt is I (0), then we say that y t , x t are cointegrated. - For example, aggregate consumption and income appear to be nonstationary processes, but appear to deviate from each other in only a stationary fashion, i.e., there exists a long-run equilibrium relationship about which there are only stationary deviations. - Note that ! is not necessarily unique. • Can estimate the cointegrating parameter ! by an OLS regression of y t on x t but although the estimator is consistent, the distribution theory is again non-standard, but has been tabulated. • More general system. Suppose that y t = (y1t, y2t)0 % Rk1+k2 and that y1t = ! 0 y2t + ut y2t = y2t!1 + 2 t ,
140
CHAPTER 10. TIME SERIES
If ut and 2 t are mutually uncorrelated, then we call the system triangular. Special results apply in this case. This model assumes knowledge about the number of cointegrating relations, i.e., k1, and it makes a particular normalization. Can • Johansen test for the presence of cointegration and the number of cointegrating relations. If we have a k-vector unit root series yt there can be no cointegrating relations, one,..., k " 1 cointegrating relations.. Johansen tests these restrictions sequentially to Þnd the right number of cointegrating relations in the data.
10.15
Martingales
• We say that the process y t is a martingale if E [yt |I t!1 ] = y t!1 a.s., where I t!1 is information available at time t, for example I t!1 = {yt!1 , . . .}, i.e., yt = yt!1 + ut , where ut is a martingale di ! erence sequence and satisÞes E [ut |I t!1 ] = 0 a.s. The process ut may be heterogeneous but is uncorrelated. — Hall (1978): Consumption is a martingale. — Fama: Stock prices are martingales.
E (P t+1 |P t , . . .) = P t . This is a bit too strong and is unsupported by the data. • The assumption of unforecastability rules out serial correlation in #t and hence rt , but it does not by itself say anything more about the distribution of #t . That is, #t could be heterogeneous and be nonnormal. It could itself be non-stationary - for example #t independent over time with #t $ N (0, f (t)) is consistent with the e"cient markets hypothesis. However, it is frequently assumed that the error term is itself stationary process.
141
10.16. GARCH MODELS
10.16
GARCH Models
• Engle (1982) introduced the following class of models rt = # t "t , where #t is i.i.d. (0, 1), while "2t = var (rt |F t!1 )
is the (time-varying) conditional variance. • For example, " 2t = ' + ( rt2!1 ,
which is the ARCH (1) model. Provided ( < 1, the process rt is weakly stationary and has Þnite unconditional variance " 2 given by "2 = E ("2t ) < /,
where " 2 = ' + (" 2 =
' . 1 " (
• This uses the law of iterated expectations E (Y ) = E (E (Y |I )) to argue
¡ ¢ ¡¡ ¡ ¢ ¢ ¢
E rt2!1
= E E #2t!1 |I t!1 " 2t!1 = E " 2t!1 = " 2 .
• The unconditional distribution of r t is thick-tailed; that is, even if # t is normally distributed, rt is going to have an unconditional distribution that is a mixture of normals and is more leptokurtic. Suppose #t is standard normal, then E (#4t ) = 3 and
¡ ¢
µ4 = E (rt4 ) = E #4t " 4t = 3E (" 4t ), where
£¡
E ("4t ) = E '2 + ( 2 rt4!1 + 2 '( rt2!1 = '2 + ( 2 µ4 + 2'("2 .
¢¤
142
CHAPTER 10. TIME SERIES
Therefore,
¡
µ4 = 3 '3 + ( 2 µ4 + 2 '(" 2 3 ('2 + 2'(" 2 ) = 1 " 3( 2 3'2 4 . & 3" = (1 " ( )2
¢
• The process rt is uncorrelated, i.e., cov(rt , rt!s ) = 0 for all s 6 = 0. However, the process r t is dependent so that E (g(rt )g(rt!s )) 6 = E (g(rt )) E (h(rt!s )) for arbitrary functions g, h, certainly for g(r) = h(r) = r2 this is not true. • Can write the process as an AR(1) process in u 2t , i.e., rt2 = ' + ( rt2!1 + 2 t , where 2 t = rt2 " " 2t is a mean zero innovation that is uncorrelated with its past. • Therefore, since ( > 0, the volatility process is positively autocorrelated, i.e., cov " 2t , "2t! j > 0.
¡ ¢
Hence we get volatility clustering. • We can rewrite the process as
¡
¢
" 2t " " 2 = ( rt2!1 " "2 .
Suppose that " 2t!1 = " 2. When we get a large shock, i.e., #2t!1 > 1, we get " 2t > " 2 but the process decays rapidly to " 2 unless we get a sequence of large shocks #2t!1+s > 1, s = 0, 1, 2, . . .. In fact, for a normal distribution the probability of having #2 > 1 is only about 0.32 so we generally see little persistence.
143
10.16. GARCH MODELS
• Although the ARCH model implies volatility clustering, it does not in practice generate enough. • Generalize to ARCH ( p), write p
" 2t
= ' +
X
( j rt2! j ,
j=1
where p is some positive integer and ( j are positive coe"cients. • This model is Þne, but estimation is di "cult. When p is large one Þnds that the coe"cients are imprecisely estimated and can be negative. Have to impose some restrictions on the coe "cients. • Instead GARCH (1, 1) "2t = ' + !" 2t!1 + ( rt2!1 ,
where ', ! , ( are positive. - We have
$
"2t
X
' ! j !1 rt2! j , = + ( 1 " ! j=1
so that it is an inÞnite order ARCH model with geometric decline in the coe"cients. - If ( + ! < 1, then the process rt is weakly stationary, i.e., the unconditional variance exists, and " 2 = E ("2t ) < /,
where " 2 = ' + !" 2 + (" 2 =
' . 1 " (! + ( )
- Surprisingly, even for some values of ! , ( with ( + ! & 1, the process " 2t is strongly stationary although the unconditional variance does not exist in this case.