AMEMIYA Takeshi Qualitative Responde Model (Article - Jstor)

American Economic Association

Qualitative Response Models: A Survey Author(s): Takeshi Amemiya Reviewed work(s): Source: Journal of Economic Literature, Vol. 19, No. 4 (Dec., 1981), pp. 1483-1536 Published by: American Economic Association Stable URL: http://www.jstor.org/stable/2724565 . Accessed: 07/02/2013 00:13 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp

. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

.

American Economic Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of Economic Literature.

http://www.jstor.org

This content downloaded on Thu, 7 Feb 2013 00:13:20 AM All use subject to JSTOR Terms and Conditions

Journal of Economic Literature Vol. XIX (December 1981), pp. 1483-1536

Qualitative A

Response SurveyV

Models:

By TAKESHI AMEMIYA Stanford University

* This work was supported by National Science Foundation Grant SES 79-12965 at the Institute for Mathematical Studies in the Social Sciences, Stanford University. A. Colin Cameron and Stephen M. Fazzari assisted the author in the preparation of a list of empirical papers. Mr. Cameron also corrected English and made helpful suggestions. The assistance of Bronwyn H. Hall in the writing of *the Appendix on a guide to computer programs is gratefully acknowledged. I am grateful to the following people for reading thefirst draft (Technical Report No. 338, May 1981, IMSSS,Stanford University) and giving me valuable comments: M. J Boskin, G. Chamberlain, S. R. Cosslett, A. S. Goldberger,J A. Hausman, J j Heckman, L. F. Lee, G. S. Maddala, T. MaCurdy, D. McFadden, F. D. Nelson, F. C. Nold, J H. Pencavel, J L. Powell, H. S. Rosen, N. E. Savin, P. Schmidt, and A. Zellner. Their suggestions have greatly improved the paper, and the remaining shortcomings are largely due to my inability to incorporate their suggestions.

I. Introduction ONE OF THE MOST IMPORTANT developments in econometrics in the past ten years has occurred in the area of qualitative response models, henceforth to be abbreviated as QR models, also known as quantal, categorical, or discrete models. In these statisticalmodels the endogenous random variables take only discrete values. One can find numerous applications of QR models in recent issues of economic journals. Their topics range over a great variety of discrete economic decisions, as

can be seen in the following list of references under various classifications.1 Labor force participation:Gronau [1976]. Gunderson [1974]. Heckman and Willis [1977]. Kahn and Morimune [1979]. Long and Jones [1980]. Medoff [1979]. Nickell [1979]. Parsons [1980a and b]. Pencavel [1979]. Schiller and Weiss [1979]. Smith [1979]. 1This list is not meant to be comprehensive, since the search has mainly concentrated on the recent issues of economic journals. See McFadden [1974 and 1976a] for more references to applications-especially papers on modal choice published in transportationjournals.

1483


1484

Journal of Economic Literature, Vol. XIX (December 1981)

Choice of occupation: Boskin [1974]. Schmidt and Strauss [1975]. Job and firm location: Duncan [1980]. Osten [1979]. Wilensky and Rossiter [1978]. Union membership: Duncan and Stafford [1980]. L. F. Lee [1978]. Schmidt and Strauss [1976]. Warren and Strauss [1979]. Migration: Akin, Guilkey, and Sickles [1979]. Bartel [1979]. Da Vanzo [1978]. Fields [1979]. Hughes [1980]. Transportation choice: Domencich and McFadden [1975]. Hausman and Wise [1978]. Lave [1970]. L. F. Lee [1977]. McGillivray [1972]. Quandt [1968]. Talvitie [1972]. Warner [1962]. Watson and Westin [1975]. Westin [1974]. Purchase of consumer durables: Cragg and Uhler [1970]. Dubin and McFadden [1980]. Hausman [1979]. Parks [1980]. Wu [1965]. Housing: David and Legg [1975]. Li [1977]. Rosen and Rosen [1980]. Uhler [1968]. Birth: Ben-Porath [1976]. Heckman and Willis [1975]. Powers, Marsh,Huckfeldt, and Johnson [1978]. Schooling: Hill [1979]. Kohn, Manski, and Mundel [1976]. Radner and Miller [1970]. Willis and Rosen [1979]. Legislation and voting: Deacon and Shapiro [1975]. Fair [1978]. Heckman [1976]. Kau and Rubin [1978]. Moore, Newman, and Thomas [1974]. Silberman and Durden [1976]. Tollefson and Pichler [19741. Criminology: Goldberg and Nold [1980]. Witte and Schmidt [1979]. Others: Betancourt and Clague [1978] (work-shifts). Blomquist [1979] (use of seat belts). Dudley and Montmarquette [1976] (foreign aid). Figlewski [1979] (horse racing). Fisher and Modigliani [1977] (sentiment toward inflation). Hutchens [1979] (remarriage). Lazear [1979] (mandatory retirement). M. L. Lee [1962] (installment borrowing).

McFadden [1976c] (selection of highways). Nerlove and Press [1973] (farming techniques). Oksanen and Williams [1978] (firm characteristics). Parks [1977] (automobile scrapping rates). Perloff and Wachter [1979] (growth classes of firms). Silberman and Talley [1974] (bank chartering). Qualitative response models have been extensively used in biometric applications for a much longer time than they have been used in economic applications. Biometricians use the models to study, for example, the effect of an insecticide on the survival or death of an insect, or the effect of a drug on the recovery or nonrecovery of a patient. The kind of QR model used by biometriciansis usually the simplest kind-a univariate dichotomous dependent variable (survivalor death) and a single independent variable (dosage). There are two factors which explain the recent upsurge of QR models in economic applications: (1) Economists deal with many variables which are either naturally discrete or recorded discretely, as one can see in the list of classificationsabove. However, relationshipsamong economic variables are not as simple as those among biological variables. Therefore, economists need to formulate more complex models involving more than one discrete variable and more than two responses, as well as using more independent variables.The estimation of such complex QR models has only recently-been made possible by the development of advanced computer technology. (2) An increasingly large number of sample surveys have been recently conducted and their results made readily available on magnetic tapes. In this paper I will present what I consider to be the basic facts about QR models. I believe that QR models are so important in economics that every applied researcher should acquire at least a cursory knowledge of these facts. I will start with the discussion of the simplest


Amemiya: Qualitative Response Models model-the model for a univariate dichotomous dependent variable (Section 2), and then move on to multi-response (some say multinomial or polychotomous) and multivariate (meaning more than one discrete dependent variable) models (Sections 3 and 4). Most of Section 2, except 2.C, is rudimentary. Sections 2.C, 3, and 4 are novel, not in the sense that I obtain new results but that I organize results in my own way and give a critical evaluation. I will pay attention to the following three problems: (1) how to specify a model which is consistent with economic theory and which is at the same time statistically manageable, (2) how to estimate and test hypotheses on the parameters of a model, and (3) what criteria to use for choosing among competing models. Topic (2) is discussed in Sections 2.B and 3.A. The reader who finds these sections too technical may skip them without affecting the understanding of the other sections. The reader should bear in mind that the field is still growing and there are still many unsolved issues in each of the three categories above-especially, (1) and (3). The discussion of each topic will be illustrated by examples taken from the aforementioned list of references. Econometric textbooks devote only a small portion, if any, to the discussion of QR models, and as yet there is no book which contains a general discussion of the kind of QR models which are useful for economists. There are several textbooks and monographs on QR models written specifically for other disciplines such as biology, psychology, sociology, or engineering. For example, see Cox [1970], Finney [1971], Plackett [1974], Haberman [1974, 1978, and 1979], Bishop, Fienberg, and Holland [1975], and Daganzo [1979]. These books (except the last) are mainly concerned with the analysis of joint discrete dependent variableswithout considering the effect of independent variables

1485

in a regression-like framework and consequently their usefulness to the economist is rather limited. Daganzo comes closest to filling the need of the economist, but his book cannot be regarded as a general discussion of QR models as he is specifically concerned with one area of application (transportmodal choice) and one particular model (multinomial probit). The following is a list of econometric textbooks which discuss QR models with the relevant page numbers: Goldberger [1964, pp.248-255], Theil [1971, pp. 628636], Maddala [1977, pp. 162-171], Dhrymes [1978, pp. 324-352], Judge, Griffiths, Hill, and T. C. Lee [1980, pp. 583621], and Pindyck and Rubinfeld [1981, pp. 273-318]. There are two excellent mutually complementary survey articles by McFadden [1974 and 1976a] and one short survey by Amemiya [1975]. Chapter 5 of Domencich and McFadden [1975] makes good introductory reading for the estimation problems of QR models. A forthcoming book edited by Manski and McFadden [1981b] contains chapters contributed by various authors discussing latest developments in many areas of the QR model. Some of the chapters are individually cited in this paper. I had to omit the discussion of many important topics concerning QR models because of my desire to limit the scope of this survey to the fundamental results. The following is a list of the most notable omissions,together with a couple of representative references for each topic: (1) distribution-free estimation methods-Manski [1975] and Cosslett [1981], (2) choicebased sampling-Manski and Lerman [1977] and Cosslett [1978], (3)joint distribution of a continuous and a discrete variable-Heckman [1978] and Lee [1978], and (4) discrete panel data analysisChamberlain[1980] and Heckman [1981]. In the Appendix, I have attached a guide to available computer programs for QR models.


1486


II. Univariate Dichotomous Models A. Model Specification

. .

. It is mathematically

convenient to define a dichotomous random variable y which takes the value of 1 if the event occurs and 0 if it does not. (Though any other pair of real numbers could be used, the choice of 1 and 0 is especially convenient.) We assume that the probability of an event depends on a vector of independent variables x * and a vector of unknown parameters 0. Using the subscript i to denote the i-th individual, we can write a univariate dichotomous model generally as (2.1)

Pi

P(yj = 1) G(x=,O),

i=1, 2,

so that we can write P(yi = 1) = F(xi'),

(2.4)

i= 1, 2, * * , n. Throughout the rest of Section 2, I will deal with the model (2.4), rather than the more general (2.1). Though (2.4) is clearly more restrictive than (2.1), I should point out that, as in the linear regression model, (2.4) is more general than it appears, since the vector of independent variables xi may be transformations of the original variables xi. (2) Three Common Forms of Probability Function: The functional forms of Fmost frequently used in applicationsare the following: Linear Probability (LP) Model: F(w) = w.

,n.

We will assume that the random variables yi are independently distributed throughout the paper. Equation (2.1) merely states, for example, that the probabilitythat the i-th insect dies depends on the dosage xi of an insecticide, or that the probability that the i-th consumer buys a car depends on the vector xi representing his income and other personal characteristics, as well as the prices. We will consider the problem of choosing an appropriate function G for a given set of data. However, (2.1) is too general to be useful. The problem of model specification is made somewhat more manageable at the cost of some generality if the researcher chooses a certain function H(x*, 0), which is known up to the parameter vector 0, and sets out to find the right function F in the model (2.2)

H(xi, 0) = xlfi

(2.3)

(1) Definition of the Model: Suppose we want to consider the occurrence or non-occurrence of an event such as "an insect dies," "a patient recovers," "a consumer buys a car," "a man drives a car to work," etc.

Most researchers choose a linear specification

P( y = 1) = F[H(x*, 0)].

Probit Model: F(w) =

F (w)

Af1 e}

1 dt.

Logit Model: F(w)= L(w)-

ew

ew

1+ ew

Though some other models will be briefly discussed in Section C(2), our main concern throughout this paper will be these three models (especially probit and logit) and their generalizations. The linear probability model has an obvious defect in that xi',6is not constrained to lie between 0 and 1 as a probability should. Though this defect can be corrected by defining F = 1 if F(xi',1) > 1 and F = 0 if F(xi',1)< 0, the procedure produces unrealistic kinks at the truncation points. Nevertheless, it has frequently been used in econometric applications,especially in the early years, because of its


Amemiya: Qualitative Response Models

1487

TABLE 1 ?P(w), L1.6(W), AND 0.5 + 0.4W FOR VARIOUS VALUES OF W.

W ?(W)

0.0 .5

L4.6 (W)

.5

0.5 + 0.4w

.5 0.6 .7257 .7231

W (D(W) L1.6(w)

0.1 .5398 .5399 .54

0.2 .5793 .5793 .58

0.3 .6179 .6177 .62

0.4 .6554 .6548 .66

0.5 .6915 .6900 .70

0.7 .7580 .7540

0.8 .7881 .7824

0.9 .8159 .8085

1.0 .8413 .8320

2.0 .9772 .9608

computational simplicity. Though I do not recommend its use in the final stage of a study, it may be used for the purpose of obtaining quick estimates in a preliminary stage. For this reason I will give rough conversion formulae below, which will enable the researcher to compare LP estimates with probit or logit estimates. Since Eyi = P(yj = 1), we have in the

probit and logit models are the standard normal distribution function and the logistic distribution function respectively. Being distribution functions, they are bounded between 0 and 1. Both the normal and the logistic distributionsare symmetric around 0 and have variances equal to 1 and V2/3, respectively. Consider a transformed logistic distribution

LP model (2.5) where ui =

yi =xi'f -

+ ui,

Eyi. This is a heteroscedas-

tic regression model since Vui = Vyi = xi ,(1 - xi,1), using the variance formula for a binomial variable, provided 0 < x 13< 1. Therefore, the least squares (LS) method yields consistent and unbiased estimates of /3 and the weighted least squares (WLS) method yields consistent and asymptotically more efficient estimates of 13.These points are more fully discussed in Goldberger [1964]. The reader should be warned, however, that neither the LS nor the WLS estimator avoids the inherent weakness of the model mentioned above, and that the WLS procedure fails if the condition 0 < x,',1 < 1 is not met. In view of the fact that the main use of the LP model is in a preliminary study, I think it better to use LS rather than WLS. However, one should be aware of the fact that the standard deviations of the LS estimates given by the standard LS program are biased bebause of the heteroscedasticity. The probability functions used for the

3.0 .9987 .9918

(2.6) (2.6)

eAw

LA(W)= + eXw~ ~~1

By choosing an appropriatevalue of Aone can make (2.6) closely approximate the standard normal distribution over a wide domain. One might think X = irIV/ should do best because it gives the logistic distributionwith zero mean and unit variance, but by trial and error one finds that A = 1.6 does better. Table 1 gives the values of ?(w) and L1.6(W) for various values of w. The table shows that the distribution functions are very close in the mid-range but the logistic distributionhas slightly heavier tails. Because of the close similarity of the two distributions, it is difficult to distinguish between them statistically unless one has an extremely large number of observations (Chambers and Cox, 1967). Thus, in the univariate dichotomous model, it does not matter much whether one uses a probit model or a logit model, except in cases where data are heavily concentrated in the tails due to the characteristics of the problem being studied. In multi-response or multivariate models



1488

to be discussed in Sections 3 and 4, however, the probit and logit models differ from each other more substantially. Suppose that one fits a probit model and obtains estimates 6 of the coefficients on xi and fits a logit model and obtains 1AL. ?)(w) as demonThen, since 14.6(w) strated in Table 1, one should have approximately (2.7)

1.638,o=-L.

The above formula should be useful if one needed a quick way to compare probit and logit estimates. In Table 1, I have also given a linear approximation of ?D(w)or L1.6 (w), which works well for the range of probabilities between 30 and 70 percent. Denoting the LP estimates by ILP, we have approximately over the above range

{3LP-O.4A except for the constant term

(2.8)

8LP- OA4I34+0.5

for the constant term, and P3LP 0.2518L

(2.9) (29)

f

except for the constantterm 0.256L + 0.5

&P=

for the constantterm. The reader should keep in mind that (2.7) - (2.9) constitute only a rough approximation and that a different set of formulae may work better over a different domain. When one wants to compare models with different probability functions, it is generally better to compare probabilities directly rather than comparing the estimates of the coefficients even after an appropriate conversion. An alternative way of comparing different models is to look at the derivatives of the probabilities with respect to a particular independent variable. Let Xik be the k-th element of the vector xi and let 13k be the k-th element of the parameter vector /3. Then, the derivatives for the three probability models are given by

(2.10)

a

Xi;A =.3k

aXk

?

(xi)

= (Xi)

k

a

exiIJ

axik

(1 + exl )2

/k

where 4) is the standard normal density function. The right-hand sides of (2.10) may be evaluated at various values of xi, or at the sample mean of xi as i ranges from 1 to n, if a unique estimate of the derivative is desired. Note that the conversion formulae (2.7), (2.8), and (2.9) are approximately obtained from (2.10) if the right-hand sides are evaluated at x3,6= 0. A similaritybetween the probit ML and the LP-WLS estimates is noted by Hill [1979] in a study of the probability of dropping out of a high school. A similarity between the logit ML and the LP-LS estimates is noted by Wilensky and Rossiter [1978] in their study of the probability of a Michigan trained physician staying in Michigan and by Pencavel [1979] in his study of labor force participation in the Seattle and Denver income maintenance experiments. See Example 2.6 below for the actual values of Pencavel's estimates. (3) Theoretical Foundation of QR Models:So far we have studied the purely mathematical properties of the LP, probit, and logit models, neglecting either the statistical or economic-theoretic aspects of QR models. In the present section I will discuss how a QR model arises from certain behavioral assumptions regarding economic decision makers. Statistical issues will be discussed in Section B. Before considering econometric examples of QR models, I will consider some biometric examples to learn how the QR models which arise in biometrics differ from those which arise in econometrics. Example 2.1. Suppose that a dosage xi (actually the logarithm of dosage is used as xi in most studies) of an insecticide is


Amemiya: Qualitative Response Models given to the i-th insect and one wants to study how the probabilityof the i-th insect dying is related to the dosage xi. (In practice, individual insects are not identified, and a certain dosage xt is given to each of nt insects in group t. However, the present analysis is easier to understand if we proceed pretending as if each insect could be identified.) In order to formulate this model, it is useful to assume that each insect possesses its own tolerance against a particular insecticide and dies when the dosage level exceeds the tolerance. Suppose that the tolerance y* of the i-th insect is an independent drawing from a distribution identical for all insects. Moreover, if the tolerance is a result of many independent and individually inconsequential additive factors, one may reasonably assume y* - N(t,,-2) because of the central limit theorem. Defining yi = 1 if the i-th insect dies and yi = 0 otherwise, we have (2.11)

P(Yi =1) =P(y

< xi)

4Xi-j

giving rise to a probit model where pi = -po- and 32 = /C-. If, on the other hand, one assumes that y* has a logistic distributionwith mean ,uand variance c-2, he gets a logit model (2.12) P(Yi= 1) = L [7

xi

j -

Similarly,a uniform distributionfor y* implies a LP model. The probit model was the first popular model which emerged in the biometric applications of QR models, partly because of the above-mentioned central limit theorem argument and partly because of the widespread popularity of the normal distribution in statistics in general. Later, the logit model gained popularity, partly because of a vigorous promotion by Berkson [1951]. However, as I noted above, a

1489

choice between the two models is unimportant because of their similarity;this is especially true if a researcher experiments with various specifications for H(x*,O),instead of setting it summarily as in (2.3). Even in terms of computation costs, the estimation of the probit model is no more costly than the logit model at the present state of computer technology. Example 2.2 (Ashford and Sowden, 1970.) A coal miner develops breathlessness (yi = 1) when his tolerance y* is less than an unknown constant y. If one assumes that y* is distributed as normal or logistic with mean a1 + a2xj and variance a,2, where xi is the coal miner's age, one has a probit model Pi = '((131+ (32xi)with or a (31= (y - a,)I/- and 132 = -a2/I-, logit model Pi = L(131+ 132xi)with f1 = 7r(y- al)/Io/3 and 132 = -7ra2Ic0TV. (Note that the four parameters y, a1, a2, and a, cannot all be identified because only two functions of them can be estimated by either model. However, a researcher is usually not interested in identifying the original parameters. Once he sets up a probit or logit model, he is only concerned with how the probability of an event in question is related to the independent variables-the coal miner's age in this instance. Thus, without loss of generality, one can adopt a normalization:for example, -y = 0 and o = 1.) In the next few examples we will consider how QR models are specified in economic applications. A majority of the economic examples of QR models which I listed in Section 1 involve events whose outcomes are determined by the decision of an economic unit, be it a consumer, a producer, or a government. (There are exceptions, such as Figlewski'smodel [1979], where outcomes are controlled by horses.) Economists assume that an economic unit makes rational decisions so as to maximize its utility. For example, in the model of Goldberg and Nold [1980], a burglar chooses the house that gives him a maxi-


1490


mum expected return (takinginto account the probability of being arrested). Thus, economic QR models usually differ from biological QR models in the basic theoretical foundation, though they may seem alike in mathematical form. An insect does not choose to die in order to maximize its utility! (However, some might argue that there is no essential difference between the two types of models by noting that a consumer's tolerance against the temptation to buy a car diminishes with his income.) Example 2.3 (Domencich and McFadden, 1975.) Let us consider the decision of a person regarding whether he drives a car or travels by transit to work. We assume that the utility associated with each mode of transport is a function of the mode characteristics z (mainly the time and the cost incurred by the use of the mode) and the individual's socio-economic characteristics w, plus an additive error term E. We define Ui1 and Uio as the. i-th person's indirect utilities associated with driving a car and travelling by transit respectively. Then, assuming a linear function, we have (2.13)

Uio ao + z0'1 + wi7yo + Ejo

and (2.14)

Uil=

a + zi

?+wiyl+

Eil.

The basic assumption is that the i-th person drives a car if Ui1> Uio and travels by transit if Ui1< Uio.(There is indecision if Ui1 = Uio,but this happens with zero probability if eil and E4oare continuous random variables.) Thus, defining yi = 1 if the i-th person drives a car, we have (2.15) P( yj = 1) = P(Uji> Uio) = P[Eo

Eil < a

-

? (Zil-zio)' ?W! (y1- Y)] -

ao) (Zii-Zio)' ? wi (y1-YO)]

F[(al

-

-

aO

where F is the distribution function of Eio -

Eii. Therefore,

what kind of QR

model one gets is equivalent to what distribution one assumes for Ejo- Eil. For ex-

ample, a probit (logit) model arises from assuming the normal (logistic)distribution for Eo-Eil-

Several remarks are in order regarding the nature of stochastic utilities we have adopted. (1) The idea of stochastic utilities seems to go back to Thurstone [1927]. References to some of the early works can be found in Marschak [1960]. One of the earliest applications of stochastic utility theory to modal choice analysis can be found in Quandt [1968]. Quandt assumes a random coefficients utility model UU= aH.C1y[,j= 0 and 1, where Hij and Cj signify the time and the cost of the j-th mode for the i-th person, and a, 13, and y (representing random taste variation among individuals) are independently distributed with exponential densities ale

ala,

a2e7

a2f1,

and

a3e-

a3V

respectively.

The parameters to be estimated from data are a1, a2, and a3. Quandt's distributional assumption, which is dictated by the positivity of a, 13,and -y,leads to a complicated QR model. The normality assumption on a, ,/, and y would lead to a simpler probittype model. Such a model was used by Hausman and Wise [1978] and Daganzo [1979] in the multi-response context. Here, however, I follow the exposition of McFadden [1974 and 1976a] as well as Domencich and McFadden [1975]. McFadden interprets the error terms as those mode characteristicsand individual socioeconomic characteristicswhich are unobservable to the researcher. This is analogous to the omitted variables interpretation of the error term in a regression model, which suggests that a zero correlation between Ejj and Ej0is not reasonable.

The existence of such correlation causes no complication in the present model since the form of the probabilityfunction


Amemiya: Qualitative Response Models depends only on the distribution of the difference Eco- eii as we have seen above.

It is assumed that the utilities of different individuals are independently distributed. This assumption may not always be met in reality since the unobservable e's may be correlated among certain individuals. However, the assumption is made throughout the paper for the sake of convenience, since it would be usually impossible to specify a correct correlation structure. (2) Another-important feature of the utilities (2.13) and (2.14) is that they depend on the mode characteristics(or attributes) zil and Zio. For an exposition of this

view of the utility function see, for example, Lancaster [1966] in the general context and Quandt and Baumol [1966] specifically in the context of transport modal choice. This type of utility function has proved very useful in modal choice analysis as well as in other applications of QR models. The significance of this approach is that one does not pay any attention to specific physical entities such as bus or car or train and instead recognizes only a bundle of services which each mode provides to each individual. Baumol and Quandt call this bundle of services an "abstract mode." This concept makes the modal choice model widely applicable, even for some future mode whose physical entity cannot be known, as long as its attributes such as time, cost, and comfort can be predicted. I will come back to this point again when I discuss multi-response models, where the advantage of using the concept of abstract modes characterized only by their attributes becomes clearer. Note that we have put the subscript i on z. This is necessary because the time and cost of a mode vary with individuals:a car driven by A is not the same car as driven by B, and the distances of travel to work will vary among individuals. Finally I should point out that having different constant terms in (2.13) and (2.14) is not entirely

1491

consistent with the idea of abstractmodes. For, it means that we are recognizing the effect of a specific mode beyond the effect of the attributes. For this reason, the purist among abstract mode believers would assume a, = ao and consequently the constant term drops out of (2.15). We have included it for a pragmatic reason, since the fit is usually much better if we include it. (3) Note that y, and yoappear in (2.15) only as the difference y, - yo. Therefore, any variable representing a socio-economic characteristic should not appear in the final formulation if its coefficient is the same for different modes. I will present the actual estimate of the parameters that appear in (2.15) reported by Domencich and McFadden [1975, p. 159]. They chose a logit specification for F. We will write (2.15) as P(yj = 1) = L(xi',/). The model was estimated by the maximum likelihood method using survey data on 115 individual trip-makers in a part of Pittsburgh in 1967. The survey gives information regarding which mode was used by an individual, the time and cost incurred in the actual trip, the time and cost that would have been incurred if the alternative mode had been used, and the individual's socio-economic characteristics. The results are as follows: (2.16) xi,=

-3.82 +?0.158TW

(0.51) (0.05) -0.382(AIV- TSS) - 2.56(AC- F) (0.25) (0.58) +4.94A/W-

(1.07)

2.91R

(1.37)

-

2.36Z,

(1.17)

where the numbers in the parentheses are the estimated asymptotic standard deviations.2 The variables are defined as follows: 2 See (2.34) for the formulafor the asymptoticvariance-covariancematrix of ML estimates in QR models.


1492


TW= transit walk time (in minutes) AIV= auto in-vehicle time TSS= transit station-to-stationtime AC= auto parking charges plus vehicle operating costs (in dollars) F= transit fare A/W= autos per worker in the household R = race (Oif white, 1 if non-white) Z= occupation (Oif blue-collar, 1 if white-collar). Note that the constant term and some of the socio-economic variables are included in their specification, but income, which is surely a determinant, must have dropped out because of a common coefficient in the two utility functions. Though most of economic QR models can be interpreted as arisingfrom the utility maximization of an economic unit, a reference to utilities.need not always be made explicit. In the modal choice model of the preceding example, one could define an index of the propensity to favor a car over transit for the i-th person, denoted y", and postulate yi* = xi'3

-,

where xi is the vector of the variables that appear in the right-hand side of (2.16) and {i corresponds to

io

-

Eil. Then, a QR

model is uniquely determined by specifying the distribution of (i and defining y = 1 if and only if y > 0. Note that a numerical value of Yi*is never observed: the researcher only observes its sign. In economic QR models, the existence of a continuous unobservable index, such as the propensity to favor a car, is often postulated without explicit reference to utility maximization, as the next two examples demonstrate. Example 2.4 (Wu, 1965.) Following the hypothesis of dynamic stock adjustment, Wu postulates that the i-th person's propensity to buy a durable good at year t is given by

(2.17)

y*4= si*- si,t_1+

dit- st,

where s* is the desired stock, s is the actual stock, and d is depreciation. We have yit = 1 if y* > 0 as before. Wu implicitly assumes a uniform distribution for eit, so a LP model is specified. Wu performs the least squares regression of yit on various independent variables which serve as proxies for s*, s, and d. Example 2.5 (L. F. Lee, 1978.) Lee defines the propensity of the i-th worker to join a union by (2.18) Yi =6 1 +

12 W[

Wio

+ Xi'3 -ei where Wi1and Wioare the union and nonunion wage rates, xi is a vector of characteristics of the i-th worker, as well as the attributes of the industry where the worker is employed. The worker joins a union (yi = 1) if and only if y* > 0. Lee assumes a normal distribution for ei and estimates the parameters by the probit ML estimator. In some instances, a researcher may wish to specify more directly how the independent variables affect the probability of a given event without explicitly referring to utility maximization, nor even a propensity index, as in the following example. Example 2.6 (Pencavel, 19793.) The main concern of Pencavel's study is to see if the working decisions of husbands and wives were affected by the Seattle and Denver income maintenance experiments. Here I mention only a small part of his study, where he estimates the proba3This paper has been revised in May 1981 and given a new title, "Unemployment and the Labor Supply Effects of the Seattle-Denver Income Maintenance Experiments," and will be published in R. G. Ehrenberg, ed., Researchin LaborEconomics, JAI Press, Greenwich, Connecticut. The LP estimates reported here no longer appearin the revised version.


Amemiya: Qualitative Response Models bility of a wife working, using observations on 1657 families during the two experimental years. (Observationsare also available for the pre-experimental year but they are not used in the particular equation I report here.) The independent variables are defined as follows: F= 1 if the family is an experimental family and 0 if a control L = 1 if the husband worked at all during the pre-experimental year and 0 otherwise Y= 1 if the observation is drawn from the second experimental year and 0 if drawn from the firstexperimental year' U= 1 if the husband experienced any unemployment during the year. Pencavel reports both the LS estimates obtained under a LP model, denoted by ILp, and the logit ML estimator, denoted by 13L. They are given as follows: (2.19) Xi, 3LP =

(2.20) Xi,L

=

-0.069F+ 0.497L + 0.055 Y + 0.043 U+ 0.036F - L -0.08F- Y-0.041F- U -0.327F+ 2.305L + 0.309 Y (0.156) (0.126) (0.121) + 0.25U + 0.097F - L (0.131) (0.169) -

0.452F-

(0.167)

Y- 0.239F - U,

(0.175)

where the numbers in the parentheses are the estimated asymptotic standard errors. Constant terms were included in the model but their estimates were not reported. Pencavel notes that the estimated probabilities produced by the two models are similar. (An interesting feature of Pencavel's model is that all the independent variables as well as the dependent variable are dichotomous. I will discuss this sort of model again in Example 4.3.)

1493

B. Estimation and Hypothesis-Testing Let y be a dichotomous dependent variable and x be a vector of independent variables. We will associate the event that the vector (y,x') takes on a particularvector value with the word cell. In discussing the estimation and hypothesis-testing of QR models, it is important to distinguish between the following two cases: (1) the case where there are many observations per cell, and (2) the case where there are only a few observations per cell. The use of the word many is ambiguous. It would be slightly less ambiguous if I said "sufficiently many for a large sample theory to work." As a rule of thumb, 30 observations per cell may be regarded as sufficient.4 If we consider a concrete example of a modal choice model (Example 2.3) many observations per cell means that there are many drivers and transit-riders with the same commuting time, the same parking fee, the same transit fare, and the same socio-economic characteristics, and so on. When there are so many independent variables, one needs an inordinately large number of observations in order to have many (say, 30) observations per cell. For example, if there are three independent variables each of which takes five distinct values, one needs at least a total of 7,500 observations on the vector (y,x'). Therefore, the case of many observations per cell does not occur frequently, particularly in economic applications. The estimation of a QR model is simpler if there are many observations per cell. To illustrate this fact, let us assume that there is a single independent variable x 4The more relevant quantity to consider is the product of the cell size (n) and the probabilitythat the outcome associatedwith the cell occurs (p). Hoel [1971, p. 82] states that the normal approximation of the binomial distributionis good if np > 5.


1494


y

x

1

x

x

x

x

x

x

1

x

x x

x x x

0

1

I x

x

x

(1)

x

(2)

x

(3)

I x

(4)

(5)

Figure 1. Observations (yi, xi) in the case of few observations per cell

Figure 2. Observations (P(t), X(t))in the case of many observations per cell

and we want to estimate a probability function F in the relationship

should be advised to design an experiment so that many observations per cell will be produced if possible. Thus, a biometrician studying the effect of an insecticide should determine several fixed dosages (this determination is also an important problem in the design of experiments) and administer each dosage to many insects. However, economists normally do not face this situation. Economists can create many observations per cell by grouping data according to artificallycreated intervals, but this procedure may not be advisable since grouping reduces information. There are also situations where (ye, xi) are not observable for every individual i but only their averages in each group t, (nT-1 yi, nt- I xi), are observable. In

P( yi = 1) =F(x),

(2.21)

i= 1, 2,

,n.

(Usually,Fis specified up to a few parameters which characterize the distribution as we have seen before, so the estimation of F means the estimation of those parameters.) In the case of few observations per cell, observations (ye,xi) will be scattered like the crosses on the y, x plane in Figure 1 above. On the other hand, suppose that x takes

on five

distinct

values

x(l),

X(2), * * *, X(5), and there are many observations on y for each value of x. Let It be the set of integers such that i E It if

and only if xi = x(t), t= 1, 2, - - *, 5. Define P(t)= 2 yi/nt where nt is the number i E It

of integers contained in It. In words, P(t) is the relative frequency of the event yi = 1 for all the i's for which xi = X(t). It can be shown that P(t), t = 1, 2,

5, constitute the sufficient statistics of the model. (This means that P(t)contain as much information about the model as that contained in individual observations yi.) The crosses in Figure 2 represent the five points (Pt), X(t)) on the plane. A look at the two figures should convince the reader that it is easier to determine Fin the case of Figure 2. If a researcher had control over the values of the independent variables, he

jEIt

iEIt

such a case the researcher should define x(t) = nt-1 z

xi and assume xi = x(t) for

ieIt

all i E It:in other words, he has to proceed on the assumption that he has the case of many observations per cell. (See Example 3.4.) However, any additionalinformation on the distribution of yi or xi within cells may permit improved estimation. For a further discussion of this point, see McFadden and Reid [1975]. The major aim of this section is to discuss the maximum likelihood (ML)estimator and the minimum chi-square(MINx2) estimator for QR models. The ML estimator can be used either in the case of few observationsper cell or many observations


Amemiya: Qualitative Response Models per cell, whereas the MIN x2 estimator can be effectively used only in the case of many observations per cell. The rest of this section is divided into three subsections: (1) Few observations per cell, (2) Many observations per cell, and (3) Modified MIN x2 estimator. (1) Few Observations Per Cell: The model we consider throughout Section 2.B is (2.4),whether there are many observations per cell or not. Results will be obtained without specifying F, but special features of the three common models (LP, probit, and logit) will be indicated whenever relevant. First, I will discuss the ML estimator. The likelihood function of the model is given by

1495

I has a single local maximum) and therefore, a solution to (2.23) is unique if it is bounded.5A useful implication of this fact is that any iterative procedure which is guaranteed to converge to a stationary point also converges to the global maximum in these models. Since (2.25) is nonlinear in /3, the ML estimator must be obtained by an iterative method. In order to define an iterative method as well as to derive the asymptotic variance-covariance matrix, we need second-orderderivatives of I and their expectation. Differentiating the column vector 1 I/ 3 with respect to the row vector /3 yields a matrix of second-order derivatives

(2.26) aJfa3lJf

n

(2.22) LF= f F(xf3 )Yi[1-F(xf3

)]'1-Yi

i =1

21 [ Fz 4

+ -(

.)]2]

f 2 (Xi' )XXi

and its natural logarithm is given by n

yylogF(xi'8)

(2.23) 1= i =1

n

+ z (1-yi)log i =1

[1-F(xi')].

The ML estimator 1ML is defined as the value of 18that maximizes either (2.22) or (2.23). Differentiating I with respect to the column vector 18yields a column vector of derivatives (2.24) y - F(xU3O)

___n

a3.

ij

-

F(xi3)]

Jf'(Xi3)xtx'

where f' is the derivative of f When we take the expectation of (2.26), the second term of the right-hand side drops out and we get C;1

(2.27) E a-a/3, f2(xit,8)

i

F(xi) [1

-

x

F(x 13)]

X

It is well known6 that under general conditions the ML estimator is consistent

F(xi'O)[1 - F(x13)] f(xf)xi,

where f denotes the derivative of F. In probit or logit models (i.e., if F = ? or L), it can be shown that IML is a solution of the equation

(2.25)

it [(xi18)[1

a3= 0.

In these models it can be also shown that I is globally concave (which implies that

5An unbounded solution may occur, for example, if yj = 1 for all i or if yj = 0 for all i.

6The consistency and the asymptotic normality of the ML estimator under general models involving i.i.d. (independent and identically distributed)samples are proved in Rao [1973, pp. 364-366]. We do not have an i.i.d. sample in our model because the distribution of yj depends on xi, which varies with i. However, Rao's proofs can be easily modified to accommodate this situation.A general proof of consistency and asymptotic normalityof ML estimators of QR models is in the appendix of Manski and McFadden [1981a].


1496


and asymptotically normal with the asymptotic variance-covariancematrixequal to -(E a2l aI3a/ )-1. Therefore, we have, denoting the asymptotic variance-covariance matrix operator by V, (2.28)

i=l

(2.31) Method of Scoring: n

V/3ML= n

Using (2.24) and (2.27), I can rewrite (2.30) as

f?

/32-4~r

f 2(Xil

F(xi'80[1

)

X

-F(xi

t1

J=1 F-(1

o2i)

F) Xj[yi-Fi

+

is obtained by evaluAn estimate of V(3ML

where I have defined Fi = F(xj'81) and

ating (2.28) at I3ML.

fi = ff(xz! '1).

The relevant formulae for the probit and logit models are obtained by putting F = 4) and f = 4) (the standard normal density) for the probit case, and F = L and f= L(1 - L) for the logit case, respectively. The logit version is used to calculate the standard deviations reported in (2.16) and (2.20). The two most commonly used iterative methods for calculating the ML estimator are the Newton-Raphson method and the method of scoring.7 Given an initial estimate 613,the second-round estimator 382 in each method is defined as follows:

a2 1

]-

ai

and (2.30) Method of Scoring: :{[E a'21

] -1

The third-round estimator f3 by substituting

/32

al iS

(2.32)

yi = F(xi 1) +

uj,

where Eui = 0 and Vui = F(xi,8)[1 -

F(x,8)]. (Note that I wrote the LP model version of this equation in (2.5)). This is a heteroscedastic nonlinear regression model. Expanding F(xi',6)in Taylor series around / =,f1 and rearrangingterms, we obtain from (2.32) (2.33) yi-Fi + f ixf'i1

f Xi i

+

ii.

Then, 132 defined in (2.31) can be interpreted as the WLS estimator of 18applied to (2.33) where Vui is estimated by Fi(1 - Fe). For this reason, the method of scoring iteration in the QR model is sometimes referred to as the nonlinear weighted least squares(NLWLS)iteration. Now, I want to consider the test of a general linear hypothesis of the form

(2.29) Newton-Raphson: ,B=,_[

An interesting interpretation of the iteration (2.31) is possible. From (2.4) we obtain

obtained

for ,J1 in the right-hand

side of (2.29) and (2.30). This procedure is to be repeated until the iteration converges. Note that the two methods differ only in that in the method of scoring the expectation is taken of the matrix of second derivatives. 7 Ipresent these methods in their basic form, but in practice they are used with various modifications designed for speedier convergence. See Goldfeld and Quandt [1972, Ch. 1] for a further discussion.

(2.34)

Q'13=c,

where Q' is a q x K matrix of known constants (K being the number of elements in 18)and c is a q-vector of known constants, each appropriately determined by the researcher. It is assumed that q < K and the q rows of Q' are linearly independent. The tests I will discuss below are not specifically designed for QR models but are general tests widely applicable in many models.


Amemiya: Qualitative Response Models In considering the test of the null hypothesis (2.34), it is useful to distinguish between the case where q = 1 and the case where q > 1. First I will consider the case when q = 1. Suppose a researcher wants to use an estimator .8 (for example 3MLin our case) to test the hypothesis. Let Vf8be a consistent estimate of the asymptotic variance-covariance matrix V,1 (for example V/3MLgiven in (2.28)). Then, a test of (2.34)

can be performed on the basis of the following asymptotic result: (2.35) where

Q V] A -

N(O, 1) under (2.34),

reads "is asymptotically distri-

buted as." If the alternative hypothesis specifies Q'18 $ c, the null hypothesis is rejected when the absolute value of the statistic (2.35)is greater than a certain prescribed value (corresponding to a particular significance level). If the alternative hypothesis says Q',1 > c (Q',1 < c), (2.34) is rejected when the statistic (2.35) is greater (smaller)than a prescribed value. Let us apply this procedure to test the null hypothesis that each of the coefficients in (2.16) is equal to zero against the alternative hypothesis that the coefficient is positive or negative depending on the sign of the estimated coefficient in the Domencich-McFadden example. To test the hypothesis that the i-th coefficient is 0, the appropriate choice of Q and c reduces the statistic (2.35) to I3ilv`_V, which can be calculated by dividing the parameter estimate by its standard deviation given in the parentheses. Therefore, since the critical value corresponding to the 5 percent significance level is ?1.64, we conclude that each of the coefficientsexcept the one on (AIV-TSS)is individually significantly different from 0 (i.e., each null hypothesis is rejected at a 5 percent significance level). If we apply the same test to the coefficients in (2.20)in the Pencavel exam-

1497

ple, we find that all the coefficients are significantly different from 0 except the ones on F L and F U. Some authors prefer to base the test on the approximation (2.36)

where

Q

A_

_

+/Q'(f73Q tn-K

tion with n

tn-K,

denotes Student's t distribu-

K degrees of freedom. For

this reason, the statistic (2.36)is sometimes referred to as an asymptotic t value. Since n - K is characteristically large in QR models, it does not make much practical difference whether (2.35) or (2.36) is used to determine the critical value. Theoretically, either (2.35) or (2.36)is correct, since tn-KA

N(0,1).

Now I consider the case q > 1. I will discuss two well-known tests: (1) Wald's test and (2) the likelihood ratio test (LRT). Wald'stest can be used in connection with any estimator,8whereas the LRT must be based on either the ML estimator or any estimator with the same asymptotic distribution. In using these tests we must always assume that the alternative hypothesis is Q',1 $ c. Though these tests are valid even when q = 1, the statistic (2.35)is preferred if q = 1 since it allows for the alternative hypothesis to be of the form Q',1 > c or Q',l3
Xq,

8Usually,Wald'stest is defined only in connection with the ML estimator or an asymptoticallyequivalent estimator. Here I have broadened its definition to include an analogous test based on an arbitrary consistent and asymptoticallynormal estimator.


1498


where X2 denotes the chi-square distribution with q degrees of freedom. The hypothesis (2.34) is to be rejected if the value of the statistic exceeds a prescribed critical value. Note that if q = 1, Wald is reduced to the square of the statistic (2.35). The likelihood ratio test is defined by (with the indicated asymptotic distribution) (2.38) LRT=

A 2[1(3ML)

-

fined for the present case in exactly the same way. From (2.4) we have

Assuming F is one-to-one, we can invert the relationship (2.39) to obtain (2.40)

where 13CML denotes the constrained maximum likelihood (CML)estimator obtained by maximizing the log-likelihood function (2.23) with respect to 18subject to the con-

classify integers (1, 2,

(2.41) + t) tX1.8 ~(Pt - Pd) F('D(P aPt Pt =

Xtt +

'

_Xtx

and we

n) into T disjoint sets A1,I2, * * ,IT by the rule: i e It if xi = x,t,. We define nt = number of

t=1,2, *.

i E It

*,

(Pt-P

T,

where the last identity merely defines vt.9 We have Evt = 0 (since EPt=

yi, and

P(t)= rtInt. As I noted at the beginning of the section, P(t)constitute the sufficient statistics of the model. We also define P(t)= P(yi = 1) for i E= It. Note that P(t) itself would be the ML estimator of P(t) if P(t)constituted Tfreely varying parameters. In fact, however, P(t)contain only K freely varying parameters-namely 18. In the subsequent discussionI will write X(t), P(t),and P(t)as xt, Pt, and Pt. I ask the reader to try to imagine the unseen 0 around the subscript t. When t takes on specific integer values, I must write them like x(1), X(2), etc. because otherwise they cannot be distinguished from x1, x2, etc., which represent different entities. Note that the case in which there are many observations per cell is a subset of the cases in which there are a few observations per call; therefore, all the results of the subsection (1) apply to this case as well. In particular,the ML estimator can be de-

f [FNPt)]

+ Vt,

,

integers contained in It, rt = I

xt'A3,

where F-1 denotes the inverse function of F. Expanding F- '(Pt) in a Taylor series around Pt, we obtain from (2.40)

is to be rejected if the value of the statistic (which is always nonnegative) exceeds a prescribed critical value. (2) Many Observations Per Cell: We suppose that the vector xi takes Tdistinct x(T)

=

t=1,2, * * *, T,

straint Q'f3 = c. The null hypothesis (2.34)

,

F-1(Pt)

X2

I(3CML)]

vector values X(1), X(2)

Pt= F(xfi), t=l,2, ** *,T.

(2.39)

(2.42)

Pt) and

Pt(1- Pt) ntf2[F-1(Pt)]

VVt

where I have written f2[F- (Pt)] for (f [F-1(Pt)]12.(Here Vis actually the exact variance.) Thus, (2.41) defines, approximately, a linear regression model with heteroscedastic error terms. The so-called minimum chi-square estimator MIN is defined as the WLS estimator applied to (2.41) using an estimate of Vvt obtained by substituting Pt for Pt in (2.42):namely, (2.43)

8MIN4 =

XtXt ]

T

ntf2[F'(Pt)]

t=

Pt(1- P) d

'P)

9The second equality in (2.41) is obtained as follows. Differentiatingboth sides of (2.40)with respect to xt,I8yields, by the chain rule, aF-'/aPt * f(xt1) = 1. Thus, the desired result follows from writing f(xt',8) = f [F-1(Pt)] using (2.40).


Amemiya: Qualitative Response Models Using plim Pt = Pt, we can show that o nt-

the asymptotic (as nt goes to infinity)distribution of I3MINis the same as the case

where the true variance (2.42) is used in defining the WLS estimator. Therefore, it can be shown that

I3MINis

consistent and

asymptotically normal with the asymptotic variance-covariancematrix given by

(2.44)

V/3MIN= [ W ,P(1P

tXt]

But this is clearly identical to (2.28), meaning that /3MINhas the same asymptotic dis-

tribution as I3ML. An estimate of VI3MINcan be obtained

either by replacing Pt with Pt or with F(xt I3MIN)in (2.44). The latter method

seems preferable. Either the asymptotic normal test based on (2.35) or Wald's asymptotic chi-square test based on (2.37) can be used in conjunction with

I3MIN

The first line of (2.41) is an approximate equality since 3F-1iaPt is evaluated at Pt. Since it would be an exact equality if 3F-1/ Mt were evaluated somewhere between Pt and Pt, the error of the approximation becomes smaller as IPt - Ptl becomes smaller. For this reason, the minimum chisquare estimator (2.43) can be interpreted as a WLS estimator only when nt is large.

If Pt = 0 or 1 for some t, (2.43) clearly cannot be defined. Even more essentially, F-1 (Pt) itself may not be defined for Pt = 0 or 1 for certain cases such as F== or L. The probability of Pt= 0 or 1 diminishes as nt gets larger, provided that Pt

$ 0 or 1, because plim Pt = Pt. nt-+,,?

In the probit model, F- (P) = ?(-l (Pt). Though 4-' does not have an explicit form, it can be easily evaluated numeri-

(2.45)

L-I(P)

=

1499

log1 -

which is called the logit transformation or the logarithm of the odds ratio. Cox [1970, p. 33] mentions the following modification of (2.45): (2.46) L-2(PJ) log 1

Pt+ (2nt)Pt + (2nt)-l

-

This modification has two advantages over (2.45): (1) The transformation (2.46) can always be defined whereas (2.45) cannot be defined if Pt = 0 or 1. (Nevertheless, it is not advisable to use Cox'smodification when nt is small.) (2) It can be shown that ELc-(Pt) -L-1 (Pt) is of the order of nT-2 whereas EL-' (P) - L- (P is of the order of n-1. If nt is large, however, (2.45) and (2.46) do not differ by much. I will give two examples of the use of the logit MIN x2 estimator in econometric applications. Example 2.7 (Li, 1977.) Li estimates a logit model which explains the probability of a family owning a home. He uses 1970 census data on 544,781 husband-wifefamilies from Boston SMSAand 420,264 families from Baltimore SMSA.(Modelsare estimated separately for each region.) The basic independent variables consist of the following twelve variables: a constant term, four age dummies, three income dummies, three family-size dummies, and a race dummy. Before proceeding further, I will explain the definition of these dummies by considering the income dummies in particular. Let Y be the income of a person measured to the closest dollar. Divide the positive half-line into four income classes and define the following four (0,1) variables: Y, = 1 if 0 < Y < 4,999, Y2 = 1 if

cally. 4-1(-) is called the probit transfor-

mation.10In the logit model, we can explicitly write 10Originally, 4- I(Pt) + 5 was called the probit (Fin-

ney, 1971, p. 23) and 4D-1(Pt)W"scalled the normit

(Berkson,1957). Nowadays, the word probit is used to denote a QRmodel where the probabilityfunction is a standardnormal distributionfunction. Similarly, the word logit used to mean the transformation (2.45) but now means a QR model with a logistic distribution.


1500

Journal of Economic Literature, Vol. XIX (December 1981) TABLE 2 THEIL'S DATA

[1971, PP. 633-35]

t

Const.

Xl

X2

nt

rt

Pt(= rt/,n)

1 2

1 1

0 0

0

89

68

0.764

1

83

50

0.602

3 4

1 1

1 1

0 1

43 164

14 19

0.326 0.116

5,000 < Y < 9,999, Y3= 1 if 10,000 < Y < 14,999, and Y4= 1 if 15,000 < Y. Now, we cannot use all the four dummy variables if a constant term is included because Y1+ Y2+ Y3+ Y4= 1 for every person. Therefore, one of the four dummies must be dropped, and Li drops Y1. In the above formulation, one estimates three parameters a2, a3, and a4 (the coefficients on Y2, Y3,and Y4)to measure the effect of income on the probabilityof owning a'house. It may sometimes be advisable to reduce the number of parameters by the following procedure: Let u2, L,3, and Uk be the mean income in each of the respective income classes. Then, use a single income variable u2Y2+ ,u3Y3+ ,4Y4 instead of the three. This practice can be regarded as a standard linear hypothesis on

a2,

a3, and a4, which can be

written as ,u2a3 = ,u3a2 and t2a4 = ,t4a2. Li estimates many models: basic models where none or some of the dummy variables are dropped, models with the aforementioned linear hypothesis imposed on certain dummy variables, and models in which the products of certain pairs of the independent variables are also included. For each model Li calculates the value of WSSR (the weighted sum of squared residuals) defined by and asymptotically distributed as (2.47) WSSR= 2 [log 1 * ntPt( -Pt)

A 2

XT_K

p -XfiMIN]

Suppose Model 1 is nested in Model 2 (that is to say, Model 1 is obtained from Model 2 by imposing a linear hypothesis such as the one mentioned in the preceding paragraph). Then, it can be shown that (2.48)

WSSR1-WSSR2 Xq q

where q is the number of linear constraints. The linear hypothesis is to be rejected (equivalently, Model 2 is chosen over Model 1) when the value of (2.48) is greater than a prescribed value. The statistic (2.48) is actually identical to the Wald statistic (2.37) if it is defined using the logit MIN x2 estimator. Li clearly has a case of many observations per cell. Since the number of cells is at most 320 in his models, he has on the average 1,702 observations per cell for the Boston data and 1,313 observations per cell for the Baltimore data. Example 2.8 (Theil, 1971, pp. 633-35.) Theil reports a logit model which explains how a firm revises its production plan when it learns its orders and inventories. The dependent variable and the two independent variables are defined as follows: y = 1 positive production plan revision 0 negative x1= 1 negative surprise on orders received 0 positive x2= 1 inventories are considered too large 0 too small Theil's data are summarized in Table 2. The regression equation (2.41) can be written in this case as



1501

TABLE 3 A COMPARISON

OF FoUR ESTIMATORS IN THEIL'S MODEL

Pt ALogit

/02

1.296 -2.232

/83

-0.977

1 Parameters

P1 ProablitesP2

Probabilitiesp3 P4

0.785 0.579 0.282 0.129

0.764 0.602 0.326 0.116

log (50/33) log (14/29) log(19/145)

LP-LS

LP-WLS

0.779 -1.346 -.0576

0.774 -0.470 -0.183

0.773 -0.470 -0.184

0.782 0.580 0.285 0.127

0.774 0.591

0.773 0.589 0.303 0.119

0.304 0.121

calls for the following modification of the

(2.49) -log (68/21)

Probit

1 0

-M1-

0

1 0 1 1 1 i 82i 1 1 1 _4

0i

V2

8

+

3

Theil reports the MIN x2 estimates of /'s. I have also calculated the probit MIN x2 estimator, the LP-LS estimator, and the LP-WLS estimator.1"All the calculations were done by a programmable pocket calculator. The values of each estimator and the corresponding probability estimates are given in Table 3. The estimates of the probabilities obtained by four estimators are similar to each other and to the observed frequencies Pt. In Section C(3) I will come back to the question of measuring the "closeness"between the observed frequencies and the estimated probabilities. (3) Modified MIN x2 Estimator: In deriving the regression equation (2.41), we assumed that (2.39) was the true model. However, this represents an optimistic attitude on the part of a researcher. In practice, a researcher using a MIN x2 estimator cannot be sure that a particular transformation he decides to use is precisely equal to the inverse of the true probability function. This pessimistic (or, realistic) attitude 11The LP-WLS estimator is obtained by the following procedure: (1) Obtain the LS estimator ILS of /3 in (2.5). (2) Estimate Vuj by xi fLS(1 - X f3LS)(3) Using the estimated variance of Vuj, apply the WLS estimator to (2.5).

MINx2 estimator. Suppose the true model (unknown to a researcher) is given by (2.50)

Pt= G(xtJ3), t=1,2, . * T,

but the researcher decides to use F-1(P) as the dependent variable of his regression. Then, instead of (2.41), we now have (2.51) F-1 (Pt) - xt'a + vt+ wt, where vt is as defined before and wt = F-P[G(xtf,i)] - xt,8. In other words, Wtis

the error resulting from a possible misspecification of the probability function. Amemiya and Nold [1975] suggested that (wtJbe treated as i.i.d. random variables with zero mean and variance Q-2 and proposed the WLS estimator applied to (2.51) using an estimate of Q-2 and an estimate of Vvt given in (2.42). Though the i.i.d. assumption for wt is admittedly an oversimplification,it seems at least better than totally ignoring wt. They proposed this

method in the context of a logit model but it can, of course, be used with regard to any other QR model. They applied the method to a simple model of household durable-goods purchase and found that though the standardMIN x2and the modified MIN x2 estimators do not differ much in actual estimates, the former tends to underestimate the variance of a regression coefficient considerably. The Amemiya-


1502


Nold modified logit MIN x2 estimator has been applied by Parks [1977] in a study of automobile scrapping rates and by Medoff [1979] in a study of layoff rates. Many researchers have linearly fitted the observed frequencies Pt on independent variables by the standard least squares method. This simple procedure can be defended in the light of the abovementioned idea of Amemiya and Nold as follows: In the case of a LP model, (2.51) becomes (2.52)

Pt=xt' J8 + vt + wt.

Though, ideally, one should apply the WLS estimator to (2.52), the LS estimator will be approximately efficient if Vwt>> Vvtsince wt is assumed to be homoscedastic. In all the three applications of the Amemiya-Nold method mentioned above, Vwtwas indeed found to be much larger

than Vvt. C. Choice of Models (1) Probit vs. Logit: I have already noted that the probit and logit models usually give similar results and it is difficult to distinguish them statistically.Chambers and Cox [1967] devised a test to distinguish the two models which can be used only when there is a single independent variable which takes on three values, with many observations on y for each value of the independent variable. Even for such a specialized case, it required an exceedingly large number of observations for a test to distinguish the two models effectively. More generally, one can use the so-called Cox text of separate families of hypotheses [1961 and 1962], which is a test generally applicable to the situation where one of the two competing models cannot be nested in the other. However, Cox's test will also require a very large number of observations. (2) Alternative Models: So far I have considered three types of QR models: LP,

probit, and logit. Other kinds of QR models have also been suggested in the statistical literature. I will merely mention a few references without going into the details. These models have been seldom used in economic applications.This may be partly explained by the fact that economists usually experiment with variousspecifications of H(x, 0). When this is done, the importance of having the right Fis diminished, as one can see from (2.2). In fact, if the researcher is willing to vary H freely, he can use any distribution function (say logit) for F. When a logistic function is used for F, this kind of model is called a "universal logit" model (or, more colorfully, "mother logit") by McFadden [1977]. Two kinds of general distribution functions, which contain the standard normal and logistic distribution as special cases, have been suggested: a kind which contains skewed distributions and a kind which contains distributionswith varying degrees of kurtosis (the heaviness of the tails). The suggestion of Prentice [1976] belongs to the first kind, whereas those of Copenhaver and Mielke [1977] and Van Montefort and Otten [1976] belong to the second. Since these models contain (at least approximately)probit and logit models as special cases, they have the merit of enabling a researcher to test probit versus logit by a classical testing procedure. (3) Scalar Criteria: By scalar criteria I mean measures such as R2 in the standard regression model, which can be conveniently used as an aid for choosing among competing models. (By a model I mean a particularspecificationof Fas well as of the explanatory variables which appear in its argument.) A blind reliance on any scalar criterion is, of course, ill-advised; many economic-theoretic and statistical factors must be simultaneously taken into consideration when one chooses a model. Nevertheless, a simple index which measures the desirability of


Amemiya: Qualitative Response Models a model can be a useful tool if it is judiciously applied. I will present several scalar criteria proposed by various authors for the purpose of choosing among QR models. I will define and appraise the merits of each. But, first, I want to make a few general remarks. Remark 1: Criteria are like loss functions and therefore we should not expect to find a single criterion which is optimal for every occasion. We can talk about general guidelines or axioms we should follow when we construct criteria, but, beyond that, we cannot hope to make definitive statements concerning the relative merits of various criteria. Every criterion I define below will satisfy the minimum degree of reasonableness and cannot be discarded a priori. A sensible strategy would be to select two or three criteria and compare the results. Remark 2: The criteria presented below are not normalized to range between 0 and 1 like R2. Though I will mention certain normalizations proposed by some authors with respect to some of the criteria, normalization is not our main concern here, for a comparison of models can be done just as well without normalization. A normalization would be beneficial if a majority of researchers used the same criterion and the same normalization, like R2 in the standard regression model. But this is an unlikely event for QR models. Remark 3: R2 in the standardregression model, aside from its value as an aid in choosing among models, conveys an important piece of information: the proportion of the variance of the dependent variable explained by the independent variables.Many of the criteria listed below do not have this property even after an appropriate normalization. But I do not wish to emphasize this aspect of R2 in my evaluation of various criteria, for my major concern is model selection. Remark 4: R2 in the standardregression

1503

model is closely related to a test of the hypothesis that all the regression coefficients except the constant term are zeros.12Some of the criteria below are related to this test and some are not. Again, I should point out that this is not my major concern. Remark 5: I have saved the most important remark for the last. Whatever criterion one decides to use, one must make some adjustment for the degrees of freedom if one is to compare models with different degrees of freedom (the sample size minus the number of the parameters to be estimated). Such corrections of R2 in the standard regression model are wellknown (Amemiya, 1980.) Finding the right correction is as difficult and controversial as finding the right criterion itself. But we will see that some criteria are more amenable to correction than others, and this is a desirable attribute for criteria to have. The following is a list of scalar criteria with brief comments on each. With regard to each of these criteria except the squared correlation coefficient and the log likelihood function, one is to choose the model for which the value of the criterion is smallest. In the following passages, F2 denotes F(x2',8)where / is whatever estimator is being used. Number of Wrong Predictions: n (2.53) (y i)2 i =1

A

A = = where 1 if F2 0 if Y2 and F2< Y2. This gives the number of wrong = 1 if and predictions because (y2- ?W2 only if yi $ yi. This criterion is frequently used in discriminant analysis, which will be discussed in Section D. This criterion is appropriatewhen one has an all or noth-

12The

appropriatetest statisticfor testing this hy-

pothesis is (n

-

K)(K

-

1)-R2(1-

R2)-', where n

is the sample size and K is the number of regression coefficients. The statistic is distributed as F(K - 1, n - K).


1504


ing loss function. Thus, when an event yi = 1 takes place, a person who estimated its probabilityto be 0.49 and a person who estimated it to be 0 are equally penalized. A major disadvantage is that if we are dealing with an event which happens with a high probability (e.g., a man working) or a low probability (e.g., a person immigrating), most models will do well by this criterion. Sum of Squared Residuals(SSR):

model would be questionable. Lave's corrected R2 ranged from 0.279 to 0.541. We do not have enough experience with the R2 in QR models to determine whether these values are too low or high enough. Morrison [1972] offers an explanation as to why R2 can be sometimes very low in a QR model, but a counter argument is given by Goldberger [1973]. SSR Weighted by Estimated Probabilities:

n

(2.54)

2

i =1

(YI-Fi)2.

(2.56)

This criterion does not suffer from the deficiency of the number of wrong predictions. This is a natural criterion, since it corresponds to the sum of squared residuals in the standardregression model, from which R2 is derived. However, its use in QR models cannot be defended as strongly as in the standard regression model because a QR model is essentially a heteroscedastic regression model. Efron [1978] defends (2.54) from a certain axiomatic point of view and suggests an analogue of R2 defined by n

(YI-Fi)2

I

(2.55) Efron's R2 = 1 -

i=1

z

(YI-

y)2

n

where

n =n-1

yi.

1

-

There are two reasons for preferring this criterion to the unweighted SSR. First, if the true probabilities were known and used in the denominator instead of the estimated probabilities, the minimization of the above criterion with respect to j8 yields the estimator of j which is asymptotically more efficient than that obtained by the minimization of the unweighted SSR.Second, it seems reasonable to attach a higher loss to the error made in predicting a random variable with a smaller variance since such a random variable should be easier to predict than the one with a larger variance. Thus, it seems reasonable to weigh the squared error by a weight which is inversely proportionalto the variance. Squared Correlation Coefficient:

i =1

^~ 2

n

Lave [1970] uses the above R2 ("corrected for degrees of freedom") in his study of a probit modal choice model estimated by the ML method. Lave presumably used the same correction as Theil's corrected R2 defined by R2 = 1 - n(n -K)(1 - R2)where n is the sample size and K is the number of unknown parameters. However, even if Theil's k2 were optimal in the standard regression model (see Amemiya, 1980, for criticism of it), its use in a heteroscedastic regression

(y -Y -Y)Fj (2.57)

n i=l

[n t

(yY-

n

0)72 1

i=1

(F -

F)2

This measure is closely related to the unweighted SSR (2.54). In the standard regression model, this is identical to Efron's R2 (2.55). Though this identity does not hold in QR models, the same criticism applies to the squared correlationcoefficient as to SSR.


Amemiya: Qualitative Response Models Log Likelihood Function: n

(2.58) 1=

[y log Fi + (1 - yi)log (1 -Fi)], where Fi here specifically denotes F(xi ML). This measure has an obvious intuitive appeal. In addition, it is especially suitable for comparisonof models with different numbers of parameters. Suppose we want to choose between the unconstrained model in which the K-component parameter vector 18 is allowed to vary freely and the constrained model in which 18 is subject to q linear constraints Q',8 = c (cf. Section B(1)). Then, because of (2.38), one should choose the unconstrained model if and only if 2[l(IML) (1CML)] is greater than the a% critical value of X. This consideration gives us a natural way to adjust for the degrees of freedom., (Though this may seem to be begging the question since one must still determine a, the reduction of the problem of the optimal adjustment for the degrees of freedom to the more familiar problem of the optimal determination of a is helpful.) If the two competing models are not nested as in the above example, the procedure described above does not work. But in such a case Cox's test of separate families of hypotheses mentioned earlier can be used as a substitute and offers a natural adjustment for the degrees of freedom. However, I prefer a much simpler formula for the adjustment for degrees of freedom which has been proposed by Akaike [1973] on an entirely different principle. The Akaike Information Criterion (AIC) is defined by AIC =-I + K, (2.59) where K is the number of parameters to be estimated. One is to choose the model for which AIC is smallest.13 2

1505

A normalization of I to produce an R2like quantity was suggested by McFadden [1974]. It is defined by

i=l

13Many other adjustmentsfor degrees of freedom have been proposed.See Amemiya [1980] for a com-

(2.60) McFadden's R2 = 1

(3ML)

-

10

where 10 is the maximum of I subject to the constraint that all the regression coefficients except the constant term are zeros. The next two criteria can be used only when there are many observations per cell. WSSR Associated with the MIN x2 Method: T

[F-I (Pt) -X'

2

I3MIN]2

Recall that the MIN x2 estimator ,3MIN was defined as the WLS estimator applied to the heteroscedastic regression model (2.41). Therefore, (2.61) is the WSSRassociated with (2.41).To derive its asymptotic distribution,it is useful to consider a normal heteroscedastic regression model (2.62)

y= X, + v,

where v - N(O,D) with D being a known = diagonal matrix. Defining , (X'D-1X)-1X' D-1y, we have (2.63) (y - X13)'JYl(y- X1)

X2T-K-

where T and K are the dimensions of the vectors y and 18respectively. Since (2.41) asymptotically possesses the same characteristics as (2.62), (2.61) is asymptotically 2

XT-K

One can use this fact to choose between the unconstrained model Pt = F(xtf) and the constrained model Pt= F(x1/31)where xit and f81 are the first K

-

q elements

of xt and 18respectively. Let WSSRUand parison of AIC and other criteria in the standard regressionmodel. Since no single one of them dominates the others, I prefer AIC for its simplicity. Though by no means a panacea,it is certainlypreferable to ignoring the degrees of freedom adjustment entirely.


1506


WSSRC be the values of (2.61) derived from the unconstrained and the constrained models respectively. Then, one should choose the unconstrained model if and only if (2.64) where

WSSRc-WSSRu > Xq, a' X

a

denotes the a% critical value

of X4

In the model (2.62), the quadratic form (2.63) is equal to -2 l, aside from a constant term. Therefore, we can write the Akaike Information Criterion (2.59) as (2.65)

AIC=

1 WSSR + K, 2

-

T

(268 (2.68)

2 -X 138] [F-'1I(Pt) t -FxO "Itf 2 (Xt? - F(xfi)] nf(f) [1 F(xfi8) I(

where I have written 3MIN simply as . SSR Weighted by Observed Frequencies:

(2.69)

-P)

T= nr(

This criterion is asymptoticallyequivalent to the criterion (2.61); therefore, all the results concerning (2.61) apply to (2.69) as well. One may also change the Pt appearing in the denominator of (2.69) to Ft.

which can be used to choose among nonnested models.

This criterion is analogous to SSR weighted by estimated probabilities defined by (2.56). To see this, observe

When F = L, the criterion (2.61) be-

comes (2.47),which Li [1977] used to compare nested models. Li could have used (2.65) to compare nonnested models as well. A normalization of (2.61), which produces an R2-type quantity, is suggested by an analogy between (2.41) and (2.62). Buse [1973] proposed the following R2 to be used for the model (2.62). Buse's R2==

(2.66)

WSS

RU

where WSSRuis what is given in (2.63) and WSSRc= y' D-ly

(2.67)

-(y

D-1)

where I is the T-vectorof ones. The main justification for Buse's definition is that as in the standard regression model, (T K)(K

-

1)-1R2(1-

R2)-1 can be used to

test the hypothesis that all the regression coefficients except a constant term are zeros. Parks[1977] used Buse's R2 in his logit model. Finally I should point out that all the preceding asymptotic results are valid if we change (2.61) to

(2.70)

1

-

Pi)

T2 iEEIt PO t=1

T

rt(l

t=1

- Pi)

Ft)2 + (nt-rt)Ft2 Pt(

= T (P,-Ft)

- Pt)

+ n

Thus, (2.69) is equivalent to the left-hand side of (2.70), which in turn is analogous to (2.56). I have calculated the values of most of the criteria defined in this section for the four estimators of Theil's model (Example 2.8) described at the end of Section B(2). Table 4 below gives these values, as well as the ranking of the four estimators according to each criterion indicated by circled numbers. I have recorded the values of the criteria for the observed frequencies Pt as well. The values for Pt are, as one would expect, better than those for any other estimator. However, one should note that four parameters are estimated in Pt whereas three parameters are estimated in any other model. When one ad-



1507

TABLE 4 THE VALUES OF SEVERAL CRITERIA FOR THE FOUR ESTIMATORS IN THEIL'S MODEL

Estimators

Criteria SSR (2.54) Corr. Coef. (2.57) 1(2.58) WSSR(2.69)

Pt

Logit

? 62.359 62.165 .313516? .315656 ? -190.879 -190.426 ? 0 1.050

justs for this difference in the degrees of freedom using either (2.59) or (2.65), one notes that Pt is actually inferior to the other estimators. (These adjustments mean adding 2 to 0 and adding -1 to -190.426 in the column for Pt.) The rankings are the same except for the fact that according to the unweighted sum of squared residualsand the correlation coefficent, LS is preferred to WLS. The possibility of this phenomenon was explained earlier. The fact that the ranking according to the weighted sum of squared residuals coincided with I gives an added support in favor of this criterion over the unweighted SSR. D. Discriminant Analysis (1) Introduction: In this section I will consider the problem of discrimination (or classification) and its relationship to QR models.14The problem is to measure the characteristics of an individual or an object and on the basis of the measurements classify the individual or the object into one of the two possible groups. For example, accept or reject a college applicant on the basis of examination scores, or determine whether a particular skull belongs to a man or an anthropoid on the basis of its measurements. We can state the problem statistically as follows. Supposing that a vector of ran14 For a thorough discussionof discriminantanalysis, the reader should refer to Anderson [1958, Ch. 6].

? 62.326 .313879 ( ? -190.778 ? .850

WLS

LS

Probit

?g 62.211 .315172?2 ?$ -190.456 (1) .216 ?g (D

62.209

0)

.315173(1)

-190.460 .226

dom variables x* is generated either according to a density g1 or go, we are to classifya given observationon x*, denoted xi, into the group characterized by either g1 or go. It is useful to define yi = 1 if xi is generated by g1 and yi = 0 if it is generated by go. We will define the decision variable Y by the rule

Y? =

1 if xi* is classi-

fied into the group characterized by g1 and Qs= 0 if it is classified into go. We will treat our problem as a standardproblem of decision under uncertainty in which yi represents the state of nature and Yithe decision variable. I will assume that the loss matrix is given by XY Yi

1

0

1

0

Lo4

0

Lo,

0

should be deThe decision variable y% termined according to some strategy. I will adopt the Bayesian strategy because my major purpose is to point out a relationship between discriminant analysis and QR models, and for this purpose the Bayesian framework is most convenient. A classical statistician would not regard yi as a random variable, since it is either 1 or 0, rather than being sometimes 1 and sometimes 0 like the outcome of a toss of a coin. A Bayesian treats yi as a random variable and solves the problem as follows:


1508

Journal of Economic Literature, Vol. XIX (December 1981) IA=

(2.71)

{(Y=

1 if P(yi = Olxt)Lio

= l x *)L

<

0 otherwise,

where P(yi = 11x), called the posterior probability of yi = 1, is determined by the Bayes' rule (2.72) P(y =1 x)=

gi (x ~)qi + go(x*)qo

In the above, q1and q0signify the so-called prior probabilities of yi = 1 and yi = 0 respectively. Now, note that (2.72) is clearly a QR model. The literature on discriminant analysis is usually more concerned with (which is the act the determination of of discriminationor classification)than the estimation of the parameters which characterize the right-hand side of (2.72), whereas the major interest of the analysis of QR models lies in estimation. At present beI will ignore the determination of yond what I stated in (2.71) and concentrate on the kind of QR model implied by the Bayesian discriminant analysis, which I will call the discriminant analysis (DA) model. What distinguishes a DA model from the ordinaryQR model I have considered up to now is the fact that a DA model specifies a joint distribution of Yzand xi -not just the conditional distribution of yi given xi as defined by (2.72). In econometric and biometric QR models, the determination of x* (e.g., income or dosage) clearly precedes that of y (e.g., purchase or death); therefore it is important to specify P(y = 11x*) whereas the specification of the distribution of x* may be ignored. On the contrary, in the DA model, the statement y = 1 (e.g., a skull belongs to a man) logically precedes the determination of x* (skullmeasurements); therefore it is more natural to specify the conditional distribution of x* given y. In subsection (2) I will define a DA estimator and in (3) I show that in certain cases the DA estimator can be effectively Aj

Aj

used as a simple alternative to the logit ML estimator in a logit model even when the DA model does not hold. (2) Normal Discriminant Analysis:By this I mean the DA model derived on the assumption that g1and goare the densities of N(1,u1,) and N(io,lo) respectively. This assumption is formally stated as x*yi == 1 N(uil,;) 273~~ fX~j1 (2.73) (

)

|xi*ly

=

-N(ko.,1o).

Since no other distribution has been used in econometric applications of DA, I will use the word DA synonymously with normal DA from now on. Under (2.73), (2.72) reduces to the following logit model (which I will call a quadratic DA model): = 11xs (2.74) PD(yj =

L(1,(i) +

+ x*'Ax*),

f(2)'X*

where (2.75)

13(1)= 1,uij-1po-

1t;

l,u

+ log q1 - log qo

1111+ log 11;O,

--log 22 (2.76)

(2) =

-1

-

0-l?

and (2.77)

A=

(;-1

-

I-1)

In the special case ;1 = lo, which is often

assumed in econometric applications, we have A = 0; therefore, (2.74) further reduces to a linear DA model: (2.78)

P(y = 11xt)=L(xi,8),

where I have written,8(1) +,8(2)'x* = x'18 to conform with the notation of Sections A through C. We will consider the ML estimation of the parameters Eil, luo, ;1, lo, ql, and qo based on observations (yi,x*), i = 1,

2,

, n. The determination of q1 and


Amemiya: Qualitative Response Models qovaries with authors.The classical statistician often assumes q1

=

qo=

'A2.The

Bayesian usually sets them equal to the values which reflect his a priori belief. Here, I will adopt the approach of Warner [1963], which I may call the Empirical Bayes' method, and treat q, and qoas unknown parameters to estimate. The likelihood function can be written as n

(2.79) LF= -

f

[g1(x*)qj]Yi[go(x*)qo]1-Yi.

~~~i=l

Equating the derivatives of log (LF) to zero yields the following ML estimators: q =n

(2.80)

n

(2.81)

?=

where n=

-'

n ini. lin

(2.82)

i1=

ya yixi*

nOi=1 ni i=i 1

So83=-1

n

(-)x*-

io)(xi*-io)

(2.85) nOi=1

If l1 = So(- ) as is often assumed, (2.84) and (2.85) should be replaced by

(1 -yi) (Xi*-AO)(x - AO)

(2.86)o+

E= n [fEYz(xi

+

- X)X-,)

n

The ML estimators ofte as ue,and A are obtained by inserting these estimates into the right-hand side of (2.75), (2.76), and (2.77).

1509

DA is frequently used in transport modal choice analysis.Warner [1962] estimated a model of binary choice between automobile and train by LP-LS and by linear DA (i.e., the case where S1 = So).

Talvitie [1972] estimated binary choice between auto and rapid transit using data on 159 work trips during morning rush hours from Skokie to the Chicago Loop. He compared 12 models with different independent variables estimated by three methods-probit ML, logit ML, and linear DA-using the number of wrong predictions (2.53) as the criterion. He concluded that the three estimators did equally well. McGillivray [1972], in his analysis of choice between auto and mass transit using data on selected urban trips in the San Francisco Bay area, tried three different types of DA: (1) q1 = qo = '2 and i = ;0, (2) q1 and qoestimated by ML and ;1 = ;0, and (3) q1 and qo estimated by ML and 11 # 1o. Though data limitation prevented McGillivrayfrom obtaining definitive conclusions, he found that the results are sensitive to the specification of q1and qo. (3) Comparison Between Logit ML and DA: In this subsection I will assume 11 = lo(= 1) so that (2.78) holds. I will compare the logit ML estimator of /, denoted 1L, and the DA estimator of 1, denoted ODA. The relative performance of the two estimators will critically depend on the assumed true distribution for x*. If (2.73) with ;1 = lo is assumed in addi-

tion to (2.78), the DA estimator is the genuine ML estimator and therefore should be asymptotically more efficient than the logit ML estimator. However, if (2.73) is not assumed, the DA estimator loses its consistency in general, whereas the logit MLestimator retains its consistency. Thus, one would expect that the logit ML estimator is more robust. Efron [1975] assumed the DA model, (2.73) and (2.78), and studied the loss of efficiency which results if 8 is estimated


1510


by the logit ML estimator. He used the asymptotic mean of the so-called errorrate as a measure of the inefficiency of an estimator. Conditional on a given estimator ,1 (be it /3DAor /3L),the error rate is defined by (2.87) Error Rate = P[x',13> 01x- N(pto,Y,)]qo + P[x'13 < 01x - N(pt1, Y)]q1

= qob([

+ qlb [

Efron derived the asymptotic mean of (2.87) for each of the cases 13= IRDAand 13= I3L,using the asymptotic distributions of the two estimators. Defining the relative efficiency of the logit ML estimator as the ratio of the asymptotic mean of the error rate of the DA estimator to that of the logit ML estimator, Efron found that the efficiency ranges between 40 percent and 90 percent for various experimental parameter values he chose. Press and Wilson [1978] compared the classificationderived from the two estimators in a couple of real data examples in which many of the independent variables are binary and therefore clearly violate the DA assumption (2.73). Their results indicate a surprisinglygood performance by DA (only slightly worse than the logit ML) in terms of the percentage of correct classificationboth for the sample observations and the validation set. A study by Amemiya and Powell [1980] was motivated by the above two articles. They considered a simple model with the characteristicssimilarto the two examples of Press and Wilson and analyzed it using the asymptotic techniques analogous to Efron's. They compared the two estimators in a logit model with two binary independent variables. The criteria they used were the asymptotic mean of the probability of correct classification(PCC) (i.e., one minus the error rate) and the asymptotic mean squared error. They found that in

terms of the PCC criterion, the DA estimator does very well-only slightly worse than the logit MLE, thus confirming the results of Press and Wilson. In all the experimental parameter values they considered, the lowest efficiency of the DA estimator in terms of the PCC criterion was 97 percent. The DA estimator performed quite well in terms of the mean squared error criterion as well, although it did not do as well as it did in terms of the PCC criterion and it did poorly for some parameter values. Though the DA estimator is inconsistent in the model they considered, the degree of inconsistency (the difference between the probability limit and the true value) was surprisingly small in a majority of the cases. Thus, normal discriminant analysis seems more robust against nonnormality than one would intuitively expect. I should point out, however, that their study was confined to the case of binary independent variables; the DA estimator may not be robust against a different type of nonnormality. McFadden [1976b] illustrates a rather significant asymptotic bias of a DA estimator in a model where the marginal distribution of the independent variable is normal. (Note that when I spoke of normality above I referred to each of the two conditional distributions given in (2.73). The marginal distribution of x* is not normal in the DA model but, rather, is a mixture of normals.) Lachenbruch, Sneeringer, and Revo [1973] also report a poor performance of the DA estimator in certain nonnormal models. III. Multi-response Models A. Definition and Statistical Inference In this section I will define a multi-response QR model in a general way and consider statistical inference to be performed for it. This section will be brief:


Amemiya: Qualitative Response Models I will merely indicate how various results of statistical inference obtained in Section 2 will apply to multi-responsemodels after fairly straightforward modifications. Sections B and C which follow will address more interesting questions regarding how various types of multi-response QR models arise in econometric and biometric applications. Assuming that the dependent variable yi takes mj + 1 values 0, 1, 2, * * * , n4, I will write a general multi-response QR model as (3.1)

P(yj =j) = Fij(x*, 0),

i= 1, 2, j = 1, 2,

,nand * , I

where x* and 0 are vectors of independent variables and parameters respectively. I will write (3.1) also as Pij = Fij. Though I have written it in this general way, all the independent variables and the parameters need not be included in the argument of every Fij. Note that P(yj = 0) (- Fio) need not be specified since it must be equal to one minus the sum of the mj probabilities defined in (3.1). It is important to let mj depend on i because in many applications individuals face different choice sets. For example, in transport modal choice analysis, travelling by train is not included in the choice set of those who live outside of its service area. I will follow the discussion of Sections 2.B and 2.C and indicate how to generalize the results contained therein to the multi-response situation. n

Defining

(mi + 1) binary variables

2 i =1

(3.2)

yij= 1 if yi =j

= O if yi *j,

i= 1, 2, * * , n and j= O, 1, * * , n4, I can write the likelihood function of model (3.1) as

1511

n

mi

i=1

j=O

LF= H H YFYij. General results about the asymptotic distribution and the iterative methods concerning ML estimation, which I have discussed in Section 2.B, apply to the present model. An interpretation of the method of scoring as the NLWLS iteration also holds. I will not prove these results explicitly here. The interested reader is referred to Amemiya [1976]. The three tests of a linear hypothesis defined in (2.36), (2.37), and (2.38) can be also straightforwardly applied to the present model. The minimum chi-square method of Section 2.B(2) can be also used to estimate the model. I will show how the method is used in a special case where model (3.1) can be written in the following way: (3.3) P(y= j) Fj(x 1I31, X!j2.

=

, X;mj3m) m

2

We are dealing with the case of many observations per cell, in which the values of the independent variables remain the same within each of Tgroups.Assume that xij = xtj, j=

1, 2, * * *, m, whenever i E

It. Here I am ruling out the case where m depends on i. Then, specializing to the case of m = 2, we have

(3.4)

t

Ptl=

F1(Xt,l1,3,

xfi,82)

Pt2= F2(Xt1f31, xt'2-I2) for i E It, t= 1, 2, T

where I have written Ptj = P(yi = j) for i E- It. A concrete example of the above

can be found in (3.26) and (3.27) below. Thus, (3.4) is a generalization of (2.39). Assuming that (3.4) defines a one-to-one mapping from xt'1,I1and xt'2132to Pt1 and Pt2,we can invert the mapping and write (3.5)

= G1(Pt1, Pt2)

xtl2 0i02 = Xt

(Pt 1, Pt2) CG2

for some functions GCand C2.Thus, (3.5) is a generalization of (2.40). Expanding C1


1512


and G2 in Taylor series around Ptj = Ptj (-

t Y,yjl ndt, we obtain Gl(Pt l1,Pt2) ~:-Xt'l.Al

model and a multi-response QR model is provided by the binary variables yij defined in (3.2). A multi-response QR model is characterized by the

(Pt -Pt1) apt2 I Pt2

n 2

msbinary varia-

bles yij, i = 1, 2, * n and j=1, 2, * * *m,n, whereas a dichotomous QR model studied in Section 2 is characterized by n binary variables yi, i = 1, 2, n. (I have omitted Yiosince it is demi termined by yio = 1 - 2 yij). Aside from j=1 * **,

(3.6) G2(Pt1,Pt2)

Xt fi2

1- Pt 1) +PlI apt 1 Ptit (Pt

aPt2

|

Pt2(Pt2-Pt2),

which is a generalization of (2.41). A specific example of (3.6) is given in (3.24) and (3.25) below. The MIN x2estimators of 61 and 2 are obtained by applying the WLS method to the bivariate heteroscedastic regression model (3.6). The modified MIN x2method of Amemiya and Nold [1975], which was discussed in Section 2.B(3), can be easily generalized to the present model: it merely amounts to adding the independent errors wit and U2t to the two equations in (3.6).

Such a model with a logit specification (see (3.29) and (3.30) below) was estimated by Parks [1980], who analyzed the consumer's choice of three alternatives-owning no car, one car, or more than one carusing data on 2,576 families interviewed for the 1970 Survey of Consumer Finances. His sole independent variable was income, which was grouped into ten income classes. As in the applications I mentioned in Section 2.A(3), Parks found that the w's dominated the error terms involving Pt, and Pt2. Now, I will consider generalizations of the scalar criteria presented in Section 2.C(3). The best way to understand the connection between a dichotomous QR

the inconsequential difference that a multi-response QR model involves a greater number of these binary variables, the essential difference between the two models lies in the fact that though hyidare independent, hyijA are not since Cov (yij, Yik) = -PZJPik for j * k. This means that the criteria which do not involve weights can be straightforwardly generalized, whereas those which involve weights must be modified to take into account the aforementioned covariance. Thus, (2.53) and (2.54) become simply n mi

(yjj - ft3)2

i=1 j=1

n mi

and

(Yj - Fij)2. i=l

j=l

For the purpose of generalizing (2.56) it is convenient to define mi-vectors yi = (Yil, Yi2, * * *, YimJ' and Pi = (Pi1, Pi2, * * , Pimd)'. Also define D(Pi) as an mj x mi diagonal matrix whose j-th diagonal element is equal to Psj.Then we can write (3.7)

E(yi - Pi)(y -P) = D (Pi) -Pi Pi'i

whose inverse can be shown to be (3.8)

[D(Pi)

-

Pi Pi]-1

=D(Pt)-1 + pi oti Pio

where I is an mi-vector of ones. Therefore, the appropriately weighted sum of squared residuals should be defined as


1513

Amemiua: Qualitative Response Models n

(3.9) WSSR1=

(YPi)

i=l

[D(Pi)-i + Ip n mi (yij

Pio

-p

I(1']iPi) )2

Pij

i=1 j=O

Thus, a generalization of (2.56) is obtained by substituting Fijfor Pijin the last expression of (3.9) as (3.10)

-

m

fl

Fij

i=3j=o

It is easy to verify that (3.10) reduces to (2.56) when m = 1. Next I will generalize (2.69). Defining (Pt , Pt2, * * *, Ptmd)', we have Pt= (3.11)

E(P

-

Pt)(Pt-

-[D(Pt)

-

nt

Pt P].

Thus, an appropriately weighted SSR is defined by T

(3.12) WSSR2= , nt(Pt-Pt)' t=1

[D(Pt)-l + p

"to

Tz

t=1

(pt t(Ptj-

t

-

1'](Pt -Pt) Pt)

Ptj

j=o

The generalization of (2.69) is obtained by substituting Ftjfor Ptjappearing in the numerator of the last expression above and substituting Ptj(here Ftjmay also be used) for Ptjappearing in the denominator:

B. Ordered Models Multi-responseQR models can be classified into ordered and unordered models. Their distinction will become clear after having seen examples of both kinds. This section will give examples of ordered models, and unordered models will be presented in Section C below. Economists use unordered models much more frequently than ordered models. I will give one biometric example and three econometric examples of ordered multi-response models. Example 3.1 (Gurland,Lee, and Dahm, 1960.) This example is a simple extension of Example 2.1 given in Section 2.A(3). As in Example 2.1, let xi be the logarithm of dosage of an insecticide given to the i-th insect and let yi*be the tolerance of the i-th insect, which I assume to be distributed as N(,u, 0.2). The present example differs from Example 2.1 only in that an additional response of an insect called "moribund"is added to the previous two responses of "alive" and "dead." The dependent variable yi is defined to take on values 0, 1, or 2 depending on whether the i-th insect is alive, moribund, or dead. We assume that yi = 2 if y" < xi and

yi = 0 if yi* > xi + y, where y is an additional unknown parameter. The model is illustrated by Figure 3 below. Mathematically, this model is specified by the following two equations:

(3.14) Pi2 = P(y < xi)=

[

xi

]

and (3.15) Pi1+Pi2=P(y*
(3.13)

_

T

t

F

(

t=l1 j=0

If the parameters in Ftj are estimated by the MIN x2 method to yield Ftj in the above, (3.13) is asymptotically distributed T

as XM_K'where M=

x [xi+v-j]

Ptj

2

t=e

mt and K is the

total number of parameters in the model.

Since 1/0. and ,I /0 are identified from (3.14) and 1/0-, yl/a, and ,u/ are identified from (3.15), all the three parameters of the model (,u,0., and y) are identifiable. As in the above example, ordered mod-


1514

Journal of Economic Literature, Vol. XIX (December 1981) DENSITYOF y*

y=o

y=2 DEAD

MORIBUND x

ALIVE x+y

Figure 3. Three responses of insects

els are used whenever the values taken by the discrete dependent variable y correspond to the intervals within which an unobservable continuous random variable y* falls. We will see in Section C that in unordered models, more than one unobservable continuous variable is needed to characterize the responses of y. Example 3.2 (David and Legg, 1975.) The authors attempt to explain the price of a home bought by a household by the size of the household, the age of the head of the household, the income of the head, and the number of years of education of the head. The price of a home is observed only to the extent that it belongs to one of three price ranges. The dependent variable is defined by

0 if the price of a home (3.16)

-

1

bought < $28,999 $29,000 - $54,999

2

> $55,000

It is assumed that an*unobservable continuous variable y* is distributed as N(xi, v.2), where xi is the vector of independent variables listed above, and yi = 0, 1, or 2 corresponding to y* E- (-??, yi), (yl, Y2), or (Y2, cc). Therefore, their model can be specified by (3.17)

Pio =

The above two equations show that not all the parameters of the model are identifiable: if we write xfi as I3o+ xi*'I31by singling out the constant term,I0, one can only identify (yi - 130)I/c,(Y2 - 63o)I/c, and 31/c.15 However, this should be of no concern to the researcher:he should be interested only in knowing how the independent variables xi affect the probability of various responses of yi. How should we interpret y'? There is no need to give it a name except to say that it is a continuous variable which affects the outcome of yi. It need not be interpreted as the actual numerical price of a home. Suppose it is. Then, ao + a1 yi + ti, for arbitraryconstants a0 and a1 and an arbitraryzero-mean normal random variable ti, will serve the purpose

equally well. However, if the actual price of a home is observable, then it is better to regress it directly on the independent variables. Usually in such a case, there is no need to estimate the probabilities(3.17) and (3.18). If one wished, one could merely substitute the least squares estimates A and a2 for 13 and 02 and put Y, = 29,000 and Y2 = 55,000. Example 3.3 (Silberman and Talley, 1974.) The authorsused data on 76 SMSA's in 1968-70 to study how the number of bank offices chartered in an SMSA is affected by total personal income, average manufacturing wages, the ratio of nonagricultural employment to population, and the dummy variables representing different banking regulations among SMSA's. The dependent variable yi is defined to take on five possible values depending on whether the number of bank offices chartered falls into the interval [0, 5], [6, 10], [11, 15], [16, 20], or [21, oo].The five values

[Yxi]

and rY2 - X'Rl

15Thismeans that one can adopt a certain normalization without losing any generality. One possible normalizationwould be to set yi and 72 at arbitrary values (subject to the constraint Y2 > y1). Another would be to set a- and 13oat arbitraryvalues (subject to c- > 0).


1515

Amemiya: Qualitative Response Models of yi are made to correspond to five intervals of an unobservable continuous random variable yMdistributednormally with mean xI,3].The authorscall Yi*"excess demand for banking."However, as I pointed out in the previous example, a name given to yi* is conceptually useful only for enabling the researcher to identify the relevant independent variables. Example 3.4 (Deacon and Shapiro, 1975.) In this paper, the authors analyzed the voting behavior of Californiansin two recent referenda: Rapid Transit Initiative (Nov. 1970) and Coastal Zone Conservation Act (Nov. 1972). Here I will take up only the former. Let AUsbe the difference between the utilities resulting from rapid transit and no rapid transit for the i-th individual. The authors assume that AUi is distributed logistically with mean ,uithat is, P(A Ui < x) = L(x

-

u) -and that

the individual vote is determined by the rule: (3.19)

Vote yes if AUi > 8 Vote no if AUi <-8i otherwise. Abstain

districts and observes average values of xi or their proxies in each district. Thus, one is forced to use a method suitable for the case of many observations per cell. The authors used data on 334 California cities. Let It be the set of individualsliving in the t-th city. Then, hypothetically assuming xi = xt for all i E It, we obtain

from (3.20) and (3.21) log -

(3.22)

and Pt(N) (3.23) log 1 -' P(N) =

(3.24) log +

(3.20)

(3.25)

and (3.21)

log Pi (N) = L (-,uj-A8).

The authors assume ,ui = x!,I3 and s = x. 132where xi is a vector of independent variables and some elements of 61 and 12 are a priori specified to be zeros. (Note that if 8i = 0, the model becomes a univariate dichotomous logit model discussed in Section 2). The above model could be estimated by ML if the individual votes were recorded and xi were observable. But, obviously, they are not: one only knows the proportion of yes votes and no votes in

P )(Y) x'(01

-132)

1

and

i)

(11 + 12).

Xt

Let Pt(Y)and Pt(N) be the proportion of yes and no votes in the t-th city. Then, expanding the left-hand side of (3.22) and (3.23) by Taylor series around Pt(Y) and Pt(N)respectively, we obtain the approximate regression equations

Writing PZ(Y)and Pi(N) for the probabilities that the i-th individual votes yes and no respectively, we have Pi(Y)= L(,uj-

(y) = Xt' (131-132)

Pt(Y)[1

(N)-X 1 -Pt(N) +

Pt(N)[1

-

-

Pt(Y)] [Pt(Y) P(Y)]

t 8

82

Pt(N)] [Pt(N)- Pt(N)].

Note that (3.24) and (3.25)constitute a special case of (3.6). The error terms of these two equations are heteroscedastic and, moreover, correlated with each other. The covariance between the error terms can be obtained from the result Cov [Pt(Y), Pt(N)] = -nt-Pt(Y)Pt(N). The MIN x2 estimates of (,161- 132) and -(,11 + 132) are obtained by applying generalized least squares to (3.24) and (3.25) taking into ac-


1516


count both heteroscedasticity and the correlation.16

C. Unordered Models (1) Independent Logit: I will indicate how to define the unordered independent logit model by specifying the probability function (3.1) for a certain i for which mj = 2. The meaning of the term independent will become clear in the next paragraph.17The case of a larger mj can be easily inferred from the following. Writing Pij = P(yi = j), j = 0, 1, and 2,

the three probabilities are specified by ei2

(3.26)

Pi2=

(3.27)

Pi, =

1+ exi? + exiP + el 1+ e xiI3

+ ext-213

and (3.28)

Pio=

1 + exi'0 + e?i

I will now discuss a very important result of McFadden [1974], which shows how the unordered independent logit model is derivable from utility maximization. Consider again a particular individual i whose utilities associated with three alternatives are given by (3.29)

Uij= ,ij + Ci,

j=0,

assumed that the individual chooses the alternative for which the associatedutility is highest. McFadden proved that the model of the form (3.26) - (3.28)is derived from utility maximization if and only if tEj} are independent and the distribution function of Ejis given by exp(-e-Ej). (This is why I call the model independent logit.) I will give only an abridged proof of the if part for the interest of the advanced reader.18

But, first, I will state the essential facts about this distribution. It is called the Type I extreme value distribution, or log Weibull distribution,by Johnson and Kotz [1970a, p. 272] who give many more results about the distribution than I give here. Its density is given by e-ij exp (-e-mj),which has a unique mode at zero and a mean of approximately0.577. If one wished the stochastic part of the utility to have zero mean, one should subtract 0.577 from it and add the same to ,ui.

Denoting the above density by f( ), we have (suppressing the subscript i from as well as from Eij)

(3.30) P(yj = 2) = P(Ui2> Ui1,Ui2> UO) = P(E2+ ,u2 l > El, > Co) E2 + P2 -,o

00 =

16Deacon and Shapiroactuallyused (3.24) and the

equation obtained by summing (3.24) and (3.25).The resulting estimates of 31 and 2 are the same as those obtained by the method described in the text. 17A more popular name for this type of model is "conditionallogit model." I do not use this term because the use of the term conditional is unclear.

r2+Al2-A

f _00

f _00

(E2)

JVo

1, and 2,

where ,ij is a nonstochastic function of explanatory variables and unknown parameters and cij is an unobservable random variable. (In the subsequent discussion, I will write ej for cij to simplify the notation). Thus, (3.29) is analogous to (2.13) and (2.14). As in Example 2.3, it is

,u,

(E)dE

1

r62+IL2-o f (Eo)dso

d4E2

00

=J

e-I2exp

(-e-2)

*exp(-e-E2-A2+IL1) *exp (-e-E2-AL2+ ")dE2

elt i2 eltio + eilil +

eAi2

18 The first proof of the if part is due to A. Marley, as reported in Luce and Suppes [1963]. McFadden rediscovered this and proved the only if part.


Amemiya: Qualitative Response Models Thus, (3.26) follows from putting Pi2 xj1,81.Formulae pLio= x3202 and p,i, -pio -== (3.27) and (3.28) can be similarly derived. Example 3.5. As an application of the unordered independent logit model, consider the following hypothetical model of transport modal choice. I will assume that the utilities associated with three alternatives-car, bus, and train (corresponding to the subscript 0, 1, and 2)-are given by (3.29). As in (2.13) and (2.14), I assume (3.31)

pi>= a + z/38 + w>'y,

where zij is a vector of the mode characteristics and wi is a vector of the i-th person's socio-economic characteristics. It is assumed that a, /3, and y are constant for all i and j. Then, we obtain the model defined by (3.26) - (3.28) by putting xi2 = Zi2-ZiO,

Xil =

Zil-ZiO

As I explained in Example 2.3, Section 2.A(3), the fact that / is constant for all the modes makes this model useful in predicting the demand for a certain new mode which comes into existence. Suppose that an estimate A of 1 has been obtained in the above model with three modes and that the characteristics Zi3 of a new mode (designated by subscript 3) have been ascertained from engineering calculations and a sample survey. Then, the probability that the typical i-th person will use the new mode (assuming that the new mode is accessible to the person) can be estimated by (3.32)

P 1=

ex

IeXi3 1

+

2

+ex'^

-exz1 + where Xi3= Zi3-Zio There is a certain inherent weakness in the independent logit model: though it works well when the alternatives are dissimilar, the assumption of independence of {E3}makes it impossible to take into account similarities among alternatives.19 19This weakness of the independent logit model was first pointed out by Debreu [1960].

1517

Using McFadden's famous example, suppose that the three alternatives in Example 3.5 consist of car, red bus, and blue bus, instead of car, bus, and train. In such a case, the independence between Ei and E2 is a clearly unreasonable assumptionbecause a high (low) utility for red bus should certainly imply a high (low)utility for blue bus. The probability

Po = P(Uo >

U1,

UO> U2) calculated under the independence assumption would surely underestimate the true probabilityin this case since the assumption ignores the fact that the event UO> U1makes the event UO> U2 more likely. Another way to look at this weakness is to note that in the independent logit model the relative probabilities between a pair of alternatives are specified without consideration of the nature of the third alternative. For example, the relative probabilities between car and red bus are specified the same way regardless of whether the third alternative is blue bus or train-a clearly untenable proposition. McFadden has called this characteristicof the model "independence from irrelevant alternatives." In Sections C(2) and C(3) I will discuss models which are free of the weakness mentioned in the preceding two paragraphs. However, compared to these models, the independent logit model has an advantage of being simple both conceptually and computationally.Therefore, so far as the alternatives are dissimilar, it is the most useful multi-responsemodel, as well as being the most frequently used one. I will present a few more examples of the model. Example 3.6 (McFadden, 1976c.) In this article, whose first draft appeared in 1968, McFadden used an independent logit model to analyze the selection of highway routes by the California Division of highways in the San Francisco and Los Angeles Districts during the years 195866. The i-th project among n = 65 proj-


1518


ects can be chosen from mi routes and the selection probability is hypothesized as (3.33)

p(yi

mj)

ezj!f z ezitkf

k=O

i=1,2, j=1,2,

n, *,

where zij is a vector of the attributes of route j in project i. There is a subtle conceptual difference between this model and the model of Example 3.5. In the latter,j signifies a certain common name of transport mode for all the individuals i. For example, j= 0 means car for all i. This is so even for those individuals for whom the j = 0 option is not available. In the former, the j-th route of the first project and the _-th route of the second project have nothing substantial in common except that both are number j routes. However, this difference is not essential, though it might at first seem so. As I explained in Section 2.A(3), in this type of model, each alternative is completely characterized by its characteristics vector z, and a common name such as car is just as meaningless as a number j in the operation of the model. McFadden considers two sets of models, each set consisting of several models. The first set uses only the cost and benefit variables as independent variables, whereas the second set uses additional independent variableswhich express people's sentiments and the degree to which people are affected by each alternative. The choice of models is done according to the log likelihood criterion (2.58) and the number of wrong predictions (2.53), each of which is generalized to the multi-response model in an obvious way. It is found that the ranking varies across the criteria. McFadden reports the degrees of freedom for each estimated model, but

he does not compare nonnested models by a procedure such as Akaike's given in (2.59). McFadden tested the hypothesis of "independence from irrelevant alternatives" by reestimating one of his models using the choice set which consists of the chosen route and one additional route randomly selected from mi. The idea is that if this hypothesis is true, estimates obtained from a full set of alternatives should be close to estimates obtained by randomly eliminating some nonchosen alternatives. For each coefficient separately, the difference between the two estimates was found to be less than its standard deviation, indicating that the hypothesis is likely to be accepted. However, to be exact, one must test the equality of all the coefficients simultaneously. Such a test, using the idea of Hausman [1978], is developed with examples in Hausman and McFadden [1981]. In most of the examples, the hypothesis of the independence from irrelevant alternatives is rejected. Example 3.7 (Perloff and Wachter, 1979.) The 1977-78 wage subsidy program gave tax credits for new employment, especially of unskilled and parttime labor. The authors tried to evaluate the effectiveness of the program by regressing the percentage increase in employment in the i-th firm, y*, on independent variables including the dummy variables representing whether or not the firm knew about the program and whether or not the firm made a conscious effort to increase employment as well as the percentage sales increase of the firm. Upon finding the results somewhat unsatisfactory, the authors then divided the range of y* into five intervals, SO= (-oo, -1), S1 = (-1, 2), S2 = (2, 30), S3 = (30, 45), and S4 = (45, oo), and estimated an unordered independent logit model explaining Pi = P(ye E Sj), using the same independent variables as in the above re-


1519

Amemiya: Qualitative Response Models gression. (In their model, xi = xi for all j, and fj varies with j, unlike the model of Examples 3.5 and 3.6). I took up this example because their procedure brings up an interesting point. In Example 3.2, I argued that if the basic continuous variable y* is distributed as N(x!,/, cr2) and observable, it is better to regress y* directly on xi by least squares than to analyze the probability of y* falling into intervals by an ordered QR model. However, if y* is distributed according to an unknown nonnormal distribution which depends on xi in more complex ways than just a location shift, the procedure of Perloff and Wachter could conceivably give better results than the least square regression of y* on xi. This is because the probabilities P(y* E Sj) for various intervals Sj could be better measures of a certain nonnormal distribution than the moments Ey* and Vy*. A disadvantage of this kind of model, in which a continuous variable is partitioned into intervals and the probability of the variable falling into each interval is specified by an unordered model, is that the estimated regression coefficients Ai can be sometimes difficultto interpret, especially if AJvaries with j in an unsystematic way. The authors recognize this problem and present P(yt E S) evaluated at appropriately chosen values of xi as the summary statistics; for example, if one is particularlyinterested in the effect of the knowledge dummy on the probabilities, one should compare the probabilities evaluated at the knowledge dummy equal to 1 versus 0, while evaluating the other independent variables at their sample means. Another summary statistic one could consider is EdyO'defined in the following way: Let hj be a representative value within Sj. Then, define 4

(3.34)

EdyO= I h,P(yi E Si). j=0

I put the subscript d (standingfor discrete) to distinguish it from the true mean Ey*. This may be a useful summary statistic in the situation where y* is not observable. However, in the present model, it is not useful because its considerationis conitrary to the very idea of trying an unordered logit model rather than a standardregression model. (2) Multi-response Discrirniinant Analysis:The normal DA model of Section 2.D can be generalized to yield a multiresponse normal DA model. It is defined by xi (yi =j) - N(pj,Yj) (3.35) and (3.36)

P(yi=j)= qj

, nandj=O, for i= 1, 2,< M.20By Bayes' rule, we obtain (3.37) P(yi =jIx ') =

mg(x z gk(X k=O

1, .

,

)qj )qk

where g) is the density function of N(pj, 1j). Just as we obtained (2.74) from (2.72), we can obtain from (3.37) P(y =jjx*) (3.38) P( 1Ox' =

L(3j(,) + A8j(2,X + xi 'Axr)

where I3j(1),RIj(2, and A are similar to (2.75), (2.76), and (2.77) except that the subscripts 1 and 0 should be changed to j and 0 respectively. As before, the term x*'Axr drops out if all the I's are identical. If we write 13j(1)+,8

x* =

3;j'xi,the three-response

(mr=2) DA model with identical variances can be written exactly in the form of (3.26) " In QR models I sometimes write P(yi = j) to denote P(yi = jlxi). This is an acceptable practice since the xi are regarded as given constants in QR models. In DA models we must distinguishbetween the priorprobabilityP(yi= j) and the posteriorprobability P(yi = jIxi).



1520

(3.28) except for the modification xi1 = Xi2 =

Xi.

I will give two examples of the use of the multi-response DA model. Example 3.8 (Powers,Marsh,Huckfeldt, and Johnson, 1978.) In the authors'model, the dependent variable is the number of children, taking six integer values 0 through 5, and the independent variables consist of the value of the house, the income of the husband and the wife, the age of the husband and the wife, the size of the city in which the individuals live, and the religion. The authors estimated the model both by the DA-ML and the unordered logit-ML estimators. They also estimated an ordered probit model using the same data. The proportion of the sample correctly classifiedby each of the three methods was as follows:logit 42.22%, probit 36.41 %, and DA 33 %. A good performance by probit is surprising when we consider the fact that an ordered model estimates much fewer parameters than an unordered model. The poor performance by DA is also surprising in view of the results presented in Section 2.D. Example 3.9 (Uhler,1968.) The housing price is divided into six classes and the probability that the price falls into one of the classes is analyzed by a multi-response quadratic DA model assuming different 5's. Log income and log age are used as the independent variables. The author calculates the approximate mean Edyi* defined in (3.34) and compares it with the true mean Ey* (the actual housing price yi*is observable). My comments at the end of Example 3.7 are valid here. The author also calculates

aEdYi'

axi

which alleviates the above weakness. This model is due to McFadden [1977] and is developed in greater detail in McFadden [1981]. I will first present a simple model with three responses and then successively move toward more general models. Let us consider the red bus, blue bus model once more for the purpose of illustration. Let Uj = t,j+ Ej, j = 0, 1, and 2, be the utilities associatedwith car, red bus, and blue bus. (To avoid unnecessary complication in notation, I will suppress the subscript i.) I pointed out earlier that it is unreasonable to assume independence between Ei and E2, though E0 may be assumed to be independent of the other two. McFadden suggests the following bivariate distribution as a convenient way to take account of a correlation between Ei

and

(3.39)

E2:

F(Ei,

E2)

= exp [-(e e)e-1 - i-

+ e-

]

0 < ?r< 1.

Johnson and Kotz (1972, p. 256) call this distribution Gumbel's type B bivariate extreme-value distribution. The correlation coefficient can be shown to be 1

-

(1

model. As for oE, I assume F(Eo) = exp

(-e-o) as in the independent logit model. Then, after a certain amount of straightforward integration, we obtain

eval-

uated at appropriately chosen values of xi, to assess the impact of xi on ys* (3) Nonindependent Logit: In Section C(1) I defined the independent logit model and pointed out its weakness which manifests itself when some of the alternatives are similar. In this section I will discuss the nonindependent logit model

-

If a- = 0 (the case of independence), F(Ei, E2) becomes the product of two Type I extreme value distributions-in other words, the case of the independent logit cr)2.

(3.40) P(y= 0) e=Lo eILo+le

[

ILi 1

1-'

L211-a-

+ e 1-5]

and e 1-a

(3.41) P(y=1jy$0)=


iLl

1L2

e - + e 1-'

Amemiya: Qualitative Response Models The form of these two probabilities is intuitively attractive. (3.41) shows that the choice between the two similar alternatives is made according to a dichotomous logit model, while (3.40) suggests that the choice between car or noncar is also like a logit model except that a certain kind of a weighted average of eIu1and eA2 is used. Next, I will generalize the three-response model above to the case of m + 1 responses. I will call this model McFadden's standard GEV (Generalized Extreme Value) model.21 Suppose that the m + 1 integers 0, 1, * * *, m can be naturally partitioned into S groups so that each group consists of similar alternatives. Write the partition as

Note that (3.44) and (3.45) are generalizations of (3.40) and (3.41) respectively. As before, we can interpret z

ei-oj-1-0"r]

as a kind of weighted average of e'i for JEB,. The following example is an interesting application of Standard GEV. Example 3.10 (McFadden, 1978.) A person chooses a community to live in and a type of dwelling to live in. There are S communities; integers in B. signify the types of dwellings available in community s. Set a. = const., a-, = const., and ptcd = ,IXCd + a'zc. Then,

= B, UB2U. .* UBs, where Udenotes the union. Then, McFadden suggests the joint distribution

=--~~BXcd d,zA t,c C'

dOEBC,

and

C, m)

as [~ei~s

=exp

6 t Xcd+a t ZC

[

(3.43) E1,

j

[

(3.46) P(Community c is chosen)

(3.42) (O, 1,2,* *,m)

F(Eo,

1521

~J

(3.47) P(Dwelling d is chosenjcis chosen) e

Then, it is shown that

/'Xcd,

e

-

d'eBc

(3.44)

LeBs

Pj= -r=l

jEBr

and (3.45)

P(y=jjjEE B) =

e1-s We

ILk 1-s

kEEBS

21 1 call this model "standardGEV" because it is a special case of what McFadden calls the GEV model, which is based on a general multivariate distribution function F(Eo, 1 * *, em) = exp [-G(e-*o, e-el, * , e7Em)], where G satisfies certain conditions. No form of GEV other than the standardGEV has been used in applications. The standard GEV is sometimes called the nested logit model.

As in this example, a nonindependent logit is useful when a set of utilities can be naturally classified into independent classes while nonzero correlation is allowed among utilities within each class. This model gives rise to a natural twostep estimation procedure. The portion of the likelihood function which is the product of the conditional probabilities of the form (3.47) can be maximized to yield an estimate of 13.Then, this estimate of 13can be inserted into the right-hand side of (3.46), and the product of (3.46) can be maximized to yield an estimate of a. These estimates, though not as asymptoticallyefficient as ML estimates, can be shown to


1522


be consistent. The method entails much less computation than ML. Though McFadden's standard GEV model takes into account the correlations among the utilities associated with similar alternatives, the reader should bear in mind that it does so in a very simple way: the correlationsare characterized by a single parameter a- in the group Bs. This is both an advantage and a disadvantage: an advantage because fewer parameters can be more efficiently estimated and a disadvantage because the true correlation structure may be more complex. In the subsections (4) and (5) below I will present two types of unordered models which can handle a more complex correlation structure. (4) Universal Logit: In Section 2.C(2) I mentioned universal logit in the univariate dichotomous situation, meaning that a given probability function G(x*, 0) can be approximatedby L[H(xi*,0)] by choosing an appropriate H(xi, 0). A similar fact holds in the multi-response case as well. Consider the three-response case of Example 3.5. A universal logit model is defined by (3.48)

=i2

1?

e9i2

e91

?

+egil

(3.49)

Pil= =

1+

1

ei2

ea

+

ei2

and (3.50)

Pio=1 +e9s1+ e9i2

where both gil and gi2 are functions of all the explanatory variables of the model-zio, Zil, Zi2, and wi. Any arbitrary three-response model can be approximated by the above by choosing the functions gil and gi2 appropriately. As long as gil and gi2 depend on all the mode characteristics, the universal logit model does not satisfy the assumption of the indepen-

dence from irrelevant alternatives. When the g's are linear in the explanatoryvariables with coefficients which generally vary with the alternatives, the model is reduced to a multi-response logit model sometimes used in applications(Cox 1966, p. 65). (5) Probit: Let U,,j= 0, 1, 2,... , be the stochastic utility associatedwith the j-th alternative for a particularindividual. By the unordered probit model I mean the model in which Ujare jointly normally distributed. Such a model was first proposed by Aitchison and Bennett [1970]. This has rarely been used in practice until recently because of its computationaldifficulty, (except, of course, the case of m = 1, in which case the model is reduced to a dichotomous probit model discussed in Section 2). To illustrate the complexity of the problem, consider the case of m = 2. Then, in order to evaluate P(y = 2), for example, one must calculate the multiple integral (3.51) P(y= 2) = P(U2> U1, U2> U0) =J

J fJ

f(UO, Ul, U2)dUodU1dU2,

where fis a trivariate normal density. The direct computation of such an integral is involved even for the case of m = 2.22 Moreover, (3.51) must be evaluated at each step of an iterative method to maximize the likelihood function. A correspondence between the responses of y and the inequalities involving the U'sin the above three-response unordered probit model may be illustrated by Figure 4. A comparison of this figure with Figure 3 shows that in the ordered model the responses of y correspond to a partition of the real line, whereas in the unor22The triple integral in (3.51) can be reduced to a double integral by a certain transformation.In general, one must evaluate m-tuple integrals for m + 1 responses.


Amemiya: Qualitative Response Models U2-U1

y=2 y=O

U2-UO

y=1

Figure 4. A partition of a plane in unordered model

dered model they correspond to a partition of a higher dimensional Euclidian space (a plane in the particular case of Figure 4). Hausman and Wise [1978] estimate a three-response unordered probit model to explain the modal choice between driving own car, sharing rides, and riding a bus, for 557 workers in Washington D.C. They specify the utilities by (3.52) Uij= ,Ji1 log xij + 3i2 log9X2 +180

dent Logit (derived from (3.52) by assuming that 13'sare non-stochastic and Eijare independently distributed as Type I extreme value distribution) and Independent Probit (derived from the model defined in the preceding paragraph by 2 = putting- o2 = cr2=O). I will call the original model General Probit. The conclusions of Hausman and Wise are as follows: (1) Logit and Independent Probit give similar results both in estimation and in the forecast of the probability of using a new mode. (2) General Probit differssignificantly from the other two models both in estimation and the forecast about the new mode. (3) General Probit fits best. The likelihood ratio test rejects Independent Probit in favor of General Probit at the 7 percent significance level. These results appear promising for further development of the model. Albright, Lerman, and Manski [1977] developed a computer program for calculating the ML estimator (by a gradient iterative method) of an unordered Probit model similar to, but slightly more general than, the model of Hausman and Wise. Their specification is (3.53)

XV+3ij X4ij

where x1, X2, X3, and x4 represent in-vehicle time, out-of-vehicle time, income, and cost respectively. They assume that are independently (with (1, 1,i2,13i3,Ej) each other) normally distributed with means 01,/2,13,0) and variances (o1 ,2, o3,1). It is assumed that Eij are independent both through i and j and ,3's are independent through i; therefore, correlation between Uijand Uik occurs because of the same /3'sappearing for all the alternatives. The authors evaluated integrals of the form (3.51) using series expansion and noted that the method is feasible for a model with up to five alternatives. Hausman and Wise also used two other models to analyze the same data:Indepen-

1523

Uij= x!f3i + Eij,

where i3 - N(0,,i3) and Ei = (eio,ei, - N(O,YE).As in Hausman and * * ,Eim) Wise, ,3i and oEare assumed to be independent of each other and independent through i. A certain normalization is employed on the parameters to make them identifiable. An interesting feature of their computer program is that it gives the user the option of calculating the probabilityof the form (3.51) at each step of the iteration either by (1) simulation or (2) Clark'sapproximation (Clark, 1961), rather than by series expansion. The authors claim that their program can handle as many as ten alternatives. Simulation works as follows: consider evaluating (3.51) for example. One artificially generates many observa-


1524


tions on Uo,U1,and U2according to fevaluated at particular parameter values and simply estimates the probabilityby the observed frequency. Clark'smethod is based on a normal approximationof the distribution of MAX(X,Y)when both X and Yare normally distributed.The exact mean and variance of MAX(X,Y)which can be easily evaluated, are used in the approximation. The authors performed a Monte Carlo study, which showed Clark's method to be quite accurate.23 Albright, et al. applied their probit model to the same data that Hausman and Wise used and estimated it by Clark's method. Their model is more general than the model of Hausmanand Wise, and their independent variables contained additional variables such as mode-specific dummies and the number of automobiles in the household. They also estimated an independent logit model. Their conclusions were as follows: (1) Their probit and logit estimates did not differ by much. (They compared the raw estimates rather than comparing aPlax for each independent variable.) (2) They could not obtain accurate estimates of I: in their probit model. (3) An increase in log L in their probit model as compared to their logit model did not seem to be large enough to compensate for a loss in degrees of freedom. (4) The logit model took 4.5 seconds per iteration, whereas the probit model took 60-75 CPU seconds per iteration on IBM 370 Model 158. Thus, all in all, though they demonstrated the feasibility of their probit model, the gain of using the probit model over the independent logit did not seem to justify the added cost for this particular data set. The discrepancy between the conclusions of Hausman and Wise and those of Albright, et al. is probably due to the fact 23Horowitz [1979 and 1981], however, shows that the method is reliable only for nonnegatively correlated variables of comparable variance and tends to overestimate small probabilities.

that Hausman and Wise imposed certain zero specifications on the covariance matrix, which suggests that covariance specification plays a crucial role in this type of model. Hausman [1979] used a probit model analogous to the Hausman-Wisemodel to analyze the consumer's choice of room coolers. (Each consumer was assumed to face a choice of three brands, a choice set varying among individuals.) (6) Sequential Probit and Logit: If the choice decision is done sequentially, the estimation of multi-response models can be reduced to the successive estimation of models with a fewer number of responses, which results in an obvious computational economy. I will illustrate this with a three-response sequential probit model. Suppose an individual determines whether y = 2 or y $ 2 and then, given y $ 2, determines whether y = 1 or 0. Assuming that each choice is made according to a dichotomous probit model, a sequential probit model is specified by = D (X2 (3.54) (3S)P2 P24x/2

182)

and (3.55)

Pl = [1 - P(x2132)14(xlRd)

The likelihood function of this model can be maximized by maximizing the likelihood function of a dichotomous probit model twice. A sequential logit model can be analogously defined. Kahn and Morimune [1979] used such a model to explain the number of employment spells a worker experienced in 1966 by such independent variables as the number of grades completed, a health dummy, a marriage dummy, the number of children, a parttime employment dummy, experience, etc. The dependent variable yi, is assumed to take one of the four values (0, 1, 2, and 3) corresponding to the number of spells experienced by the i-th worker, except that Yi = 3 means "greater than or equal


Amemiya: Qualitative Response Models to 3 spells." The authors specify probabilities sequentially as (3.56) P(yi =0) = L (xi' 8o), (3.57) P(yi = 11yi 0O)= L(xi'/,), and (3.58) P(y= 21yi = O,yi$ 1) = L(xi'

2).

This model has the same disadvantage as that pointed out in Example 3.7: namely, a difficultyin interpreting the estimated coefficients40o,A,, and 2. To mitigate this disadvantage, the authors calcu3

late Eyi = , jP(yi = j), ignoring the fact j=O that yi = 3 actually means "greater than or equal to 3," and aEyi,axi to assess the effect of xi on yi. It seems to me that the authors could have used an ordered logit model with their data, since one could theorize that there exists a certain continuous unobservable variable y* (interpreted as a measure of the tendency for unemployment) which affects the discrete outcome. Specifying y* = xi,/3 + Ei would lead to an ordered model of the type discussed in Section B. IV. Multivariate Models A. Introduction A multivariate QR model specifies the joint probability distribution of two or more discrete dependent variables. For example, suppose there are two dichotomous dependent variables y, and Y2 each of which takes values 1 or 0. Their joint distribution can be described by the table Y2

1 (4.1)

0

Yi 1

P11 P1o

0

PO01Poo

1525

where Pjk= P(y, = j, y2 = k). The model is completed

by specifying P11, P1o, and

Po1as functions of independent variables and unknown parameters. (Poois determined as one minus the sum of the other three probabilities). A multivariate QR model is a special case of a multi-response QR model. For example, the model represented by (4.1) is equivalent to a multi-response model for a single discrete dependent variable which takes four values with probabilities P11,P1o,Po1,and Poo.Therefore, the theory of statistical inference I discussed in regard to a multi-response model in Section 3.A is valid for a multivariate model without modification. The only new topic I should specifically discuss concerning multivariate models is how to specify the probabilities by taking account of special features of a multivariate model. Before going into the discussionof various ways to specify multivariate QR models, I will comment on two empirical papers, which are concerned with a basically multivariate model and yet deliberately ignore its specific multivariate features (perhaps justifiably to a certain extent). I do so in the hope of shedding some light on the distinction between a multivariate model and the other multi-response models. Example 4.1 (Silberman and Durden, 1976): They analyze how representatives voted on two bills (House bill and Substitute bill) concerning the minimum wage. The independent variables are the socioeconomic characteristics of a legislator's congressional district: namely, the campaign contribution of labor unions, the campaign contribution of small business, the percentage of people earning less than $4,000, the percentage of people in age 16-21, and a dummy for the South. Denoting House bill and Substitutebill by Hand S, the actual count of votes on the bills were as described in the table:


1526

Journal of Economic Literature, Vol. XIX (December 1981) H Yes

No

Yes

75

119

No

205

0

(4.2)

The zero count in the last cell explains why the authors (justifiably) did not set up a multivariate QR model to analyze the data. Instead, they used an ordered probit model by ordering the three nonzero responses in the order of a representative's feeling in favor of the minimum wage as follows: y*= feeling

(4.3)

in

y

S

H

favor of the minimum wage

0 1 2

Yes Yes No

No Yes Yes

Weakest Medium Strongest

Assuming that y* is normally distributed with the mean linearly dependent on the independent variables, the authors specified the probabilities as (4.4) and

Pio= 4(xi' /)

Pio+ Pi, = 4(xif? + a), (4.5) just like Example 3.2. An alternative specification which takes into consideration the multivariate nature of the problem, and at the same time recognizes the zero count of the last cell, may be developed in the form of a sequential probit model as follows: (4.6)

P(Hi = Yes) = N (xi'/)

and

(4.7) P(Si= NolHi= Yes)= 13(xi'2). A choice between these two models must be determined, first, from a theoretical

standpoint based on an analysisof the legislator's behavior, and, second, if the first consideration is not conclusive, from a statistical standpoint based on some measure of goodness of fit, such as one of the scalar criteria discussed in Section 2.C(3). Since the two models involve different numbers of parameters, a certain adjustment for the degrees of freedom applicable for comparing nonnested models, such as (2.59), must be employed. The problem boils down to the question: Is the model defined by (4.6) and (4.7)sufficientlybetter than the model defined by (4.4) and (4.5) to compensate for a reduction in the degrees of freedom? Example 4.2 (Bartel, 1979): Bartel analyzed a joint determination of a work decision and migration. Each individual faces six alternatives as described in the following table: Mig.

Not Migrate Migrate

Job

(4.8)

Quit

P1

P4

Laid off

P2

P5

Keep Job

P3

P6

Bartel ignored the multivariate nature (as well as the multi-response nature) of the data and estimated each of the five probabilities P1through P5separatelyby the univariate, dichotomous LogitML estimator. Though her procedure has the merit of computational simplicity, it is beset with the following two problems: (1) The sum of the five estimated probabilities could exceed unity. (2) A correlation among the five dependent variables is ignored. (Cf. Section 3.A). Any of the multi-response models analyzed in Section 3 would provide a framework by which both problems could be alleviated. Moreover, if one believed that the work decision preceded


Amemiya: Qualitative Response Models the migration decision, a sequential model like (4.6) and (4.7), where the marginal probabilities of a job decision and the conditional probabilities of a migration decision given a job decision are specified, would be more appropriate.

denote the characteristics of car and bus respectively, which are constant regardless of the time of travel. I will show that the above parameterization leads to a two-step estimation method. Suppose that the i-th person chooses y, = 1 and y2= 1. Then, his contri-

B. Logit In this section I will discuss a multivariate situation which is treated merely as a multi-response model. Though any of the multi-response models discussed in Section 3 can be used in a multivariate situation, I will only consider McFadden's independent logit and nonindependent logit and illustrate their use in a 2 x 2 case. Though the form of the model discussed below is the same as multi-response logit discussed earlier, I will indicate a certain parameterization which is likely to occur in the multivariate case and which naturally leads to a simple two-step estimation method. An example of a multivariate independent logit model may be specified as in the following table: 1

(4.9)

1527

~~~~~0

1

d-l exp (x'ix + y' z)

d-l exp (1 'xio+ y' z,)

0

d-l exp (' Xoi + Y'zo)

d-l exp

where d is the sum of the four exponential finctions. The only novelty of the above specification beyond the usual multi-response independent logit model lies in its particular parameterization. This parameterization is useful whenever there are certain independent variables which affect only one of the two dichotomous variables. For example, this model was used by Domencich and McFadden [1975] in a study of the joint determination of travelling by car (Yi = 1) or bus (y, = 0) and travelling during rush hours (y2 = 1) or off-rush hours (Y2 = 0). Then, z1 and zo

bution to the likelihood function, denoted by LFi, is given by (suppressing the subscript i from y's, x's, and z's) (4.10) LFi= P(Y2= lyl= exixll -

1) * P(y1= 1)

+ eO'xlo [e13'x11+z1 + ev'xlo+Izz1l,

which shows that the first part of LFi due to the conditional probability depends only on /3, whereas the second part due to the marginal probability depends both on /3 and -y.This is true regardless of the values taken by y, and y2. Therefore, the likelihood function can be written in the form (4.11)

LF= g(/)h(/3, y).

Domencich and McFadden propose the following estimation procedure: First, estimate,l/ by maximizing g(,/) and call this estimator ,1. Second, estimate y by maximizing h(,y,).- In Amemiya [1978], I showed that this estimator is not asymptotically as efficient as the true ML estimator and obtained its asymptotic variancecovariance matrix. I also indicated how to generalize the estimation method to the case where there are more than two binary dependent variables. In such a case, the computational advantage of this procedure could be considerable compared to ML. Next I will show how to use McFadden's standard GEV in a 2 x 2 multivariate model. In Section 3.C(3), I noted that


1528


McFadden's GEV is useful whenever a set of alternatives can be classifiedinto classes each of which contains similar alternatives. The GEV model is useful in a multivariate situation because the alternatives can be naturallyclassifiedaccording to the outcome of one of the variables.For example, in a 2 x 2 case as in (4.1), the four alternatives can be classified according to whether y, = 1 or y, = 0. Using a parameterization similar to (4.9), the standard GEV model defined by (3.44) and (3.45) may be specialized to a 2 x 2 multivariate model as follows: (4.12)

P(y1= 1) 1

1

-l

aieV'y'iY1

C ij-1

1

d'leull

d-leilo

0

d-leLol

d1leLoo

\

where d is the normalization chosen to make the sum of probabilities equal to unity. (An independent logit model (4.9) is derived from (4.15) by specifying the way ,'s depend on the independent variables.) The second alternative parameterization is called a log-linear model and is given by Y2

lly=

e

1) =,

xl

1-or1

(4.16)

0t1

and P(y2 = 1 y1 =)

=

0

1

j=Q

e 1u-l + e 1-oi

(4.14)

~~0

_i

,

13'x11

(4.13) P(y2=

(4.15)

1

e1-0'

aieylzi i=O

Y2

-

j=0

=

The first alternative parameterization is called a logit model and is given by

13'xoi -o $'ix
e

e 1-or + e 1-oEr A two-step estimation method analogous to the one defined above can be applied to this model as well. C. Log-Linear

A log-linear model refers to a particular parameterization of a multivariate model. I will discuss it in the context of the 2 x 2 model (4.1). For the moment, I assume that there are no independent variables and that there is no constraint among the probabilities;therefore, the model is completely characterized by specifying any three of the four probabilities appearing in the table (4.1). I will call (4.1) the basic parameterization and consider two alternative parameterizations.

\

8'

1

d-leal+a2+al2

0

d-lea2

dleal :

where d, again, is a proper normalization, not necessarily equal to the d in (4.15). The three models, (4.1), (4.15), and (4.16), are equivalent; they differ only in parameterization. Parameterizations (4.15) and (4.16) are especially similar and both have the attractive feature that the conditional probabilities have a simple logistic form. For example; in (4.15), we have (4.17)

P(yj = ly2

=

1)= =

eILll + eL

-L(ujj

-

p'oj).

The parameterization (4.16) has an additional attractive feature: a12 = 0 if and only if y, and y2 are independent, as can be easily proved. The role of a12 can be also seen by the following equation which can be derived from (4.16): (4.18) P(y1= ly2) = L(a1 + al2y2).


Amemiya: Qualitative Response Models (Note that one obtains two equations from (4.18) by putting y2 = 1 and Y2 = 0.) The log-linear parameterization (4.16) may be also defined by (4.19) P(yi,

y2) crexp (aly1 + a2y2 + a12YlYA

where crreads "is proportional to." This formulation can be generalized to a loglinear model of more than two binary random variables as follows (I will only write the case of three variables): (4.20) P(yi, Y2,

y3)

Cxexp (aly1 + a2y2 + a3y3

+ ai2yly2

+ a13y

y3 + a23y2y3

+ a123yly2Y3).

The first three terms in the exponential function are called the main effects. Terms involving the product of two variables are called second-order interaction terms, the product of three variables third-order interaction terms, and so on. Note that (4.20) involves seven parameters, which can be put into a one-to-one correspondence with the seven probabilities which completely determine the distribution of yi, y2, and y3. Such a model, without any constraint among the parameters, is called a saturated model. A saturated model for Jbinary variables involves 2 J - 1 parameters. Researchers often use a constrained log-linear model, called an unsaturated model, which is obtained by setting some of the higher-order interaction terms to zero. I will illustrate a multivariate log-linear model by the following example: Example 4.3 (Goodman, 1972): In this paper Goodman wants to explain whether a soldier prefers a Northern camp to a Southern camp (yo)by the race of the soldier (yi), the region of his origin (y2),and the present location of his camp (North or South) (y3). Because of the property that each conditional probability has a logistic form, a log-linear model is especially

1529

suitable for analyzing a model of this sort. Because of the nature of the problem, it is useful to write a conditional probability which is a generalization of (4.18). We have (4.21)

P( YO= 1 1Yl, Y2, Y3) = L(ao + aoly1 + aO2Y2 + aO3y3 + aY12l Y2 + aY13ylY3 + aO23Y2Y3 + ao123YlY2Y3).

Goodman looks at the asymptotic t value of the ML estimate of each a (the ML estimate divided by its asymptotic standard deviation) and tentatively concludes a012= a013= a0123= 0, called the null hypothesis. Then he proceeds to accept the null hypothesis as a result of the following testing procedure: Define Pt, t = 1, , 16, as the observed frequencies 2, in the 16 cells created by all the possible joint outcomes of the four binary variables. (They can be interpreted as the unconstrained ML estimators of the probabilities Pt.) Define Ft as the constrained ML estimator of Pt under the null hypothesis. Then, one is to reject the null hypothesis if and only if

(4.22)

ni

16 (Pt AFt)2 >X2

t=_

Pt

,

where n is the total number of soldiers in the sample and X32ais the a% critical value of X2. (Note that the left-hand side of (4.22) is a special case of (3.13). As mentioned there, the Pt in the denominator of (4.22) may be replaced by Pt.)Or, alternatively, one can use (4.23)

2n , Ptlog Ft t=1

32a

Note that the above model is similar to the model of Example 2.6 (Pencavel, 1979), a dichotomous logit model with binary independent variables. I will indicate how to generalize a loglinear model to the case of discrete varia-


1530


bles which take more than two values. This is done simply by using the binary variables defined in (3.2). There I pointed out that a consideration of these variables can effectively reduce a multi-response model to a dichotomous model. I will illustrate this idea by a simple example. Suppose there are two variables z and y3 such that z takes three values, 0, 1, and 2, and y3 takes two values 0 and 1. I define two binary (0, 1) variables yi and y2 by the rule: yi = 1 if z = 1 and y2 = 1 if z = 2.

Then I can specify P(z, y3) by specifying P(yi, Y2, y3), which I can specify by a loglinear model as in (4.20). However, one should remember one small detail: since y1y2 = 0 by definition in the present case, the two terms involving yiy2 in the righthand side of (4.20) drop out. In the above discussion I have touched upon only a small aspect of the log-linear model. There is a vast amount of work on this topic in the statistical literature. The interested reader should consult Haberman [1978 and 1979] or Bishop, Fienberg, and Holland [1975] mentioned in Section I, or the many references to Leo Goodman's papers cited therein. Nerlove and Press [1973] proposed the idea of making the parameters of a loglinear model dependent on independent variables. Specifically they propose the main-effect parameters-a1, a2, and a3 in (4.20)-to be linear combinations of independent variables. (However, there is no logical necessity to restrict this formulation to the main effects.) Nerlove and Press applied their log-linear model to data on agricultural practices of Filipino farmers. Their study involves four binary dependent variables corresponding to the use or nonuse of the following modern agricultural practices: (1) fertilizers and insecticides, (2) a mechanized method of land preparation, (3) the use of a highyield variety, and (4) modern methods of planting and weeding. Their independent variables include the following character-

istics of a farmer or his farm: age, schooling, land ownership, area, presence of irrigation, etc. They estimated an almost saturated model (only the fourth-order interaction term is set to zero) by ML. Though of considerable theoretical interest, their results were not satisfactoryfrom an empirical standpoint as the estimated coefficients on the first three independent variables were not significantat the 5 percent level. The aforementioned property of a loglinear model, that each conditional probability has a logit form as in (4.21), (which is shared by multivariate logit models of Section 4.B), leads to the following estimation procedure which is simpler than ML: Maximize the product of the conditional probabilities with respect to the parameters which appear therein. The remaining parameters must be estimated by maximizing the remaining part of the likelihood function. In Amemiya [1975], I gave a sketch of a proof of the consistency of this estimator. One would expect that the estimator is, in general, not asymptotically as efficient as the ML estimator. However, Monte Carlo evidence, as reported by Guilkey and Schmidt [1979], suggests that the loss of efficiency may be minor. D. Probit A multivariate probit model was first proposed by Ashford and Sowden [1970] and applied to a bivariate set of data. I will describe their model below. As in Example 2.2, I assume that a coalminer develops breathlessness (yi = 1) if his toler-

ance yl1

yr is

less than 0. Assumingthat

N(-13 x, 1) where x = (1, Age)', we

have (4.24)

P( Y = 1) = 4(1 x).

(Note that I have applied a certain normalization to the model of Example 2.2.) Now, suppose that a coalminer develops wheeze (Y2 = 1) if his tolerance y2


Amemiya: Qualitative Response Models against wheeze is less than 0 and that N(-f3 x, 1). Then we have P(Y2=1)2=4(f3x). (4.25) Now that I have specified the marginal probabilities of yi and y2, the multivariate

model is completed by specifying the joint probability P(yi = 1, y2 = 1), which in

turn is determined if the joint distribution of y1 and y2*is specified. Ashfordand Sowden assume that y1 and y are jointly normal with a correlation coefficient p. Thus, (4.26) P(y1 = 1, y2= 1) = Fp(3 x,I3 x), where Fpdenotes the bivariate normal distribution function with zero means, unit variances, and correlation p. It is instructive to note a fundamental difference between the Ashford-Sowden model and multivariate logit or log-linear models discussed in Sections 4.B and C: in the former, marginal probabilities are first specified and then a joint probability consistent with the given marginal probabilities is found, whereas in the latter, joint probabilities or conditional probabilities are specified at the outset. The consequence of these different methods of specification is that marginalprobabilitieshave a simple form (probit)in the former, and conditional probabilities have a simple form (logit) in the latter. The reader might wonder whether one could specify an Ashford-Sowdentype bivariate logit model by assuming the logistic distributionfor y* and y* in the above example. The answer to this interesting question is unfortunately in the negative since, as Gumbel [1961] has shown, there is no bivariate distributionwith an unconstrained correlation coefficient, for which marginal distributionsare logistic.24Thus, the similarity between the normal distribution and the logistic distribution in the 24 Gumbel found two types of bivariate logistic distributions;but for one of them p = 1/2 and for the

other IIp 0-304-

1531

univariate case does not carry over to the multivariate case. Because of the fundamental difference between a multivariate probit model and a multivariate logit model mentioned above, it is an important practicalproblem for a researcher to compare the two types of models using some criterion of goodness of fit. Morimune [1979] compared the Ashford-Sowden bivariate probit model with the Nerlove-Press log-linear model empirically in a model where the two binary dependent variablesrepresent home ownership (yr) and whether or not the house has more than five rooms (y2). As

criteria for comparison, Morimune used Cox's test mentioned earlier and his own modification of it. He concluded that probit was preferred to logit by either test. V. Conclusions My major aim in writing this survey has been to present a critical evaluation of a large number of multi-responseand multivariate QR models, which have been proposed in the biometric and econometric literature, in the hope that this survey will be a useful guide to the reader who must choose an appropriate QR model for explaining his data. The same set of discrete data can be analyzed by many different QR models, as I have indicated in some of the empirical examples discussed in the paper. Thus, the choice of model is critically important. The choice must be made both from an economic-theoretic and a statistical point of view. I have tried to address both issues to a certain extent. A rather long discussion of scalar criteria in Section 2.D(3) offers a statistical point of view. I have devoted approximately half of this paper to the discussion of univariate dichotomous models. I have done so not only because these models are important on their own but also because this discussion provides a necessary introduction to


1532


the discussion of the multi-response and multivariate models which follow. Economists deal with even more complex models involving discrete variables than the ones I have discussed in this paper. An example is the model with both continuous and discrete random variables, which is now quite extensively used in empirical studies. For example, see Heckman [1974] (the joint determination of whether or not a person works and hours of work) and Duncan [1980] (the joint determination of the firm location and the inputoutput vectors). These models are related not only to QR models but also to the socalled Tobit model (Tobin, 1958), but, then, the Tobit model is a special case of the disequilibrium model (Fair and Jaffee, 1972). A discussion of all these will require another long survey. Appendix Computer Programsfor Qualitative Response Models Software available from: Bronwyn H. Hall 204 Junipero Serra Boulevard Stanford, California94305 1. PROBIT: Dichotomous probit model with one or more independent variables. This program is entirely in Fortran IV and has space for up to 59 variablesand an unlimited number of observations, since the data are kept on an external file. 2. MLOGIT:This program estimates the multiresponse logit model. Both parameters and variables are allowed to vary with alternatives. The maximum number of parameters in the current version is 100, and the number of observations are not limited. The method of estimation is maximum likelihood. This program is also entirely in Fortran IV. 3. MAXLIK:A general-purposeprogram for estimation of statisticalmodels by maximum likelihood using the algorithm of Berndt, Hall, Hall, and Hausman [1975]. This program requires the user to program the likelihood function and its first derivatives, but can be used for many nonstandard models for which canned packages do not exist. MAXLIKis also in machine independent Fortran IV.

Other Software 4. QUAIL: QUAIL (QUAlitative, Intermittant, and Limited Dependent Variable Statistical Program) is a large package for the analysis of non-continuousdependent variables.It contains statistical procedures for multi-response independent logit. The Standard GEV (or, nested logit) models can be estimated by the two-step procedure explained in Section 3.C(3). Its capacity to retrieve and transform previous results, and to use macros to define complex procedures, makes two-step estimation easier to execute. The program is written in Fortran IV and runs on IBM or CDC computers. QUAIL has a data-paging procedure which allows an unlimited number of observations. This is available from Professor David Brownstone Department of Economics Princeton University Princeton, New Jersey 08540 5. SAS:This large statistical package (for IBM computers only) contains several procedures for the analysisof multi-way contingency table data and some programs in the supplemental library specifically for qualitative dependent variables. The SAS package may be obtained from SAS Institute, Inc. P.O. Box 10066 Raleigh, N.C. 27605 REFERENCES

"Polychotomous Quantal Response by Maximum Indicant," Biometrika,Aug. 1970, pp. 253-62. AKAIKE, H. "InformationTheory and an Extension of the MaximumLikelihood Principle,"in Second international symposium on information theory. Edited by B. N. PETROV AND F. CSASKI. Budapest: AkademiaiKiado, 1973. AKIN, J. S.; GUILKEY, D. K. AND SICKLES, R. "A Random CoefficientProbit Model with an Application to a Study of Migration,"J. Econometrics,Oct.Dec. 1979, pp. 233-46. ALBRIGHT, R. L.; LERMAN, S. R. AND MANSKI, C. F. "Report on the Development of an Estimation Program for the MultinomialProbit Model." Mimeographed. Prepared for the Federal Highway Administration,October 1977. To be republished as Lerman, S. R. and Manski, C. F. "On the Use of Simulated Frequencies to Approximate Choice Probabilities,"in Structuralanalysis of discrete data. Edited by C. F. MANSKI AND D. McFADDEN. Cambridge, Mass.:MIT Press, forthcoming 1981. AITCHISON, J. AND BENNETT, J.


1533

Amemiya: Qualitative Response Models T. "QualitativeResponse Models," Ann. Econ. Soc. Measure.,Summer 1975, pp. 363-72.

AMEMIYA,

. ""The Maximum Likelihood, the Minimum

Chi-Square and the Nonlinear Weighted LeastSquares Estimator in the General Qualitative Response Model,"J Amer. Statist. Assoc.,June 1976, pp. 347-51. . "On a Two-Step Estimation of a Multivariate

Logit Model,"J Econometrics,Aug. 1978, pp. 1321. . Selection of Regressors," Int. Econ. Rev., June 1980, pp. 331-54. AND NOLD, F. C. "A Modified Logit Model,"

Rev. Econ. Statist., May 1975, pp. 255-57. AND POWELL, J. L. "A Comparison of the Logit Model and Normal Discriminant Analysis when the Independent Variables are Binary." Technical Report No. 320, Institute for Mathematical Studiesin the SocialSciences, StanfordUniversity, Aug. 1980. ANDERSON, T. W. An introduction to multivariate statistical analysis. New York:Wiley, 1958. ASHFORD, J. R. AND SOWDEN, R. R. "Multivariate Probit Analysis,"Biometrics,Sept. 1970, pp. 53546. BARTEL, A. "The Migration Decision: What Role Does Job Mobility Play?"Amer. Econ. Rev., Dec. 1979, pp. 775-86. BEN-PORATH, Y. "FertilityResponse to Child Mortality: Micro Data from Israel,"J. Polit. Econ., Aug. 1976, pp. S163-78. BERKSON, J. "Why I Prefer Logits to Probits," Biometrics, Dec. 1951, pp. 327-59. . "Tables for Use in Estimating the Normal

Distribution Function by Normit Analysis," Biometrika, 1957, 44(2), pp. 411-35. BERNDT, E. K.; HALL, B. H.; HALL, R. E. et al. "Estimation and Inference in Nonlinear Structural Models,"Ann. Econ. Soc. Measure.,Oct. 1974, pp. 653-66. BETANCOURT, R. R. AND CLAGUE, C. K. "An Econometric Analysisof Capital Utilization," Int. Econ. Rev., Feb. 1978, pp. 211-28. BISHOP, Y. M. M., FIENBERG, S. E. AND HOLLAND, P. W. Discrete multivariate analysis, theory and practice. Cambridge, Mass:MIT Press, 1975. BLOMQUIST, G. "Value of Life Saving: Implications of Consumption Activity," J. Polit. Econ., June 1979, pp. 540-58. BOSKIN, M.J. "A ConditionalLogit Model of Occupational Choice,"J. Polit. Econ., Mar.-Apr. 1974, pp. 389-98. BUSE, A. "Goodness of Fit in Generalized Least SquaresEstimation,"Amer. Statist.,June 1973, pp. 106-08. CHAMBERLAIN, G. "Analysis of Covariance with Qualitative Data," Rev. Econ. Stud.,Jan. 1980, pp. 225-38. CHAMBERS, E. A. AND Cox, D. R. "Discrimination Between Alternative Binary Response Models," Biometrika, 1967, 54(3-4), pp. 573-78. CLARK, C. "The Greatest of a Finite Set of Random Variables," Operations Res., Mar.-Apr. 1961, pp. 145-62.

COPENHAVER,T. W. AND MIELKE,P. W. "Quantit Analysis: A Quantal Assay Refinement," Biometrics, Mar. 1977, pp. 175-86.

S. R. "Efficient Estimation of DiscreteCOSSLETT, Choice Models from Choice-Based Samples." Mimeographed.Workshopin TransportationEconomics, University of California,Berkeley, Aug. 1978. . "Distribution-Free Maximum-Likelihood Es-

timator of the Binary Choice Model." Mimeographed. Department of Economics, Northwestern University, March 1981 (revised). Cox, D. R. "Tests of Separate Families of Hypotheses," in Proceedingsof thefourth Berkeleysymposium on mathematical statistics and probability. Vol. 1. Edited by J. NEYMAN,Berkeley, Calif.: University of California Press, 1961, pp. 105-23. . "Further Results on Tests of Separate Fami-

lies of Hypotheses," J Roy. Statist. Soc., Series B, 1962, 24(2), pp. 406-24. . "Some Procedures Connected with the Lo-

gistic Qualitative Response Curve," in Research papers in statistics. Edited by F. N. DAVID.New York:Wiley, 1966, pp. 55-71. . The analysis of binary data. London: Me-

thuen, 1970. CRAGG,J. G. AND UHLER, R. S. "The Demand for

Automobiles," Can. J Econ., Aug. 1970, pp. 386406. DA VANZO,J. "Does Unemployment Affect Migration?-Evidence from Micro Data," Rev. Econ. Statist., Nov. 1978, pp. 504-14. C. Multinomial Probit. New York:AcaDAGANZO, demic Press, 1979. DAVID, J. M. AND LEGG, W. E. "An Application of

Multivariate Probit Analysis to the Demand for Housing: A Contribution to the Improvement of the Predictive Performance of Demand Theory, Preliminary Results," Amer. Statist. Assoc. Proceedings of the Bus. and Econ. Statist. Section, Aug. 1975, pp. 295-300. DEACON, R. AND SHAPIRO,P. "Private Preference

for Collective Goods Revealed Through Voting and Referenda,"Amer. Econ. Rev., Dec. 1975, pp. 943-55. G. "Review of 'IndividualChoice Behavior' DEBREU, by R. Luce," Amer. Econ. Rev., March 1960, pp. 186-88. P. Introductoryeconometrics.New York: DHRYMES, Springer-Verlag,1978. DOMENCICH,T. A. AND McFADDEN, D. Urban travel

demand. Amsterdam:North-Holland,1975. DUBIN, J. A. AND McFADDEN, D. "An Econometric Analysis of Residential Electric Appliance Hold-

ings and Consumption," Mimeographed. Feb. 1980. C. "A Model DUDLEY, L. AND MONTMARQUETTE, of the Supply

of Bilateral

Foreign

Aid," Amer.

Econ. Rev., Mar. 1976, pp. 132-42. DUNCAN, G. M. "Formulation

and Statistical Analysis

of the Mixed, Continuous/Discrete Dependent Variable Model in ClassicalProduction Theory," Econometrica,May 1980, pp. 839-52. AND STAFFORD,F. "Do Union Members Re-


1534


ceive Compensating Wage Differentials?"Amer. Econ. Rev., June 1980, pp. 355-71. EFRON,B. "The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis,"J Amer. Statist. Assoc., Dec. 1975, pp. 892-98. . "Regression and ANOVA with Zero-One Data: Measures of Residual Variation,"I. Amer. Statist. Assoc., Mar. 1978, pp. 113-21. FAIR, R. C. "The Effect of Economic Events on Votes for President," Rev. Econ. Statist., May 1978, pp. 159-73. AND JAFFEE, D. M. "Methods of Estimation

for Markets in Disequilibrium," Econometrica, May 1972, pp. 497-514. FIELDS, G. S. "Place to Place Migration:Some New Evidence," Rev. Econ. Statist., Feb. 1979, pp. 2132. FIGLEWSKI, S. "Subjective Information and Market '. Polit. Econ., Efficiency in a Betting Market," Feb. 1979, pp. 75-88. FINNEY, D. J. Probit analysis. Third edition. Cambridge: University Press, 1971. FISHER, S. AND MODIGLIANI, F. "Aspectsof the Costs of Inflation."Mimeographed.Nov. 1977. GOLDBERG, I., AND NOLD, F. C. "Does Reporting Deter Burglars?-An Empirical Analysis of Risk and Return in Crime," Rev. Econ. Statist., Aug. 1980, pp. 424-31. GOLDBERGER, A. S. Econometric theory. New York: Wiley, 1964. . 'Correlations Between

Binary Outcomes

and ProbabilisticPredictions,"J Amer. Statist.Assoc., Mar. 1973, p. 84. GOLDFELD, S. M. AND QUANDT, R. E. Nonlinear

methods in econometrics.Amsterdam:North-Holland, 1972. GOODMAN, L. A. "A Modified Multiple Regression Approach to the Analysis of Dichotomous Variables," Amer. Sociological Rev., Feb. 1972, pp. 2846. GRONAU, R. "The Allocation of Time of Israeli Women," J Polit. Econ., Aug. 1976, pp. S201-20. GUILKEY, D. K. AND SCHMIDT, P. "Some Small Sample Properties of Estimatorsand Test Statisticsin the Multivariate Logit Model," J. Econometrics, April 1979, pp. 33-42. E. J. "BivariateLogistic Distributions,"J GUMBEL, Amer. Statist. Assoc.,June 1961, pp. 335-49. GUNDERSON,M. "Retention of Trainees: A Study

with Dichotomous Dependent Variables," J. Econometrics,May 1974, pp. 79-93. GURLAND,J.; LEE, I. AND DAHM, P. A. "Polychoto-

mous Quantal Response in Biological Assay,"Biometrics, Sept. 1960, pp. 382-98. S. J. The analysis of frequency data. HABERMAN, Chicago: University of Chicago Press, 1974. . Analysis of qualitative data. Vol. I. Introductory topics. New York:Academic Press, 1978. . Analysis of qualitative data. Vol. II. New developments. New York:Academic Press, 1979. HAUSMAN, J. A. "SpecificationTests in Econometrics," Econometrica,Nov. 1978, pp. 1251-72. . "Individual Discount Rates and the Purchase

and Utilization of Energy-Using Durables," Bell

J Econ. Manage. Sci., Spring 1979, pp. 33-54. AND McFADDEN,

D. "A Specification Test

for the Independence of Irrelevant Alternatives in Logit Models."Unpublishedpaper given at the European Workshopon Discrete Choice Models, 1981. _____AND WISE, D. A. "AConditionalProbit Model for Qualitative Choice: Discrete Decisions Recognizing Interdependence and Heterogeneous Preferences," Econometrica, Mar. 1978, pp. 403-26. HECKMAN, J. J. "ShadowPrices, MarketWages, and Labor Supply,"Econometrica,July 1974, pp. 67994. . "Simultaneous Equations Models with Con-

tinuous and Discrete Endogenous Variables and StructuralShifts,"in Studies in nonlinear estimation. Edited by S. M. GOLDFELD AND R. E. QUANDT. Cambridge, Mass.:Ballinger, 1976, pp. 235-72. . "Dummy Endogenous Variables in a Simulta-

neous EquationSystem,"Econometrica,July 1978, pp. 931-60. . "Statistical Models for Discrete Panel Data,"

in Structuralanalysis of discrete data. Edited by C. F. MANSKI AND D. McFADDEN. Cambridge, Mass.:MIT Press, forthcoming 1981. AND WILLIS, R. J. "Estimationof a Stochastic Model of Reproduction: An Econometric Approach,"in Household production and consumption. Edited by N. E. TERLECKYJ. New York: NBER, 1975, pp. 99-138. __

. "A Beta-Logistic Model for the Analysis of

Sequential Labor Force Participationby Married Women,"J Polit. Econ., Feb. 1977, pp. 27-58. HILL, C. R. "Capacities, Opportunities and Educational Investments: The Case of the High School Dropout," Rev. Econ. Statist., Feb. 1979, pp. 920. HOEL, P. G. Introduction to mathematicalstatistics. Fourth edition. New York:Wiley, 1971. HOROWITZ, J. "A Note on the Accuracyof the Clark Approximationfor the MultinomialProbitModel." Mimeographed. MIT, 1979. . "The Accuracy of the Multinomial Logit

Model as an Approximation to the Multinomial Probit Model of Travel Demand," Transp. Res., Series B, forthcoming 1981. HUGHES, G. A. "On the Estimation of Migration Equations." Mimeographed. Faculty of Economics, Cambridge University, 1980. HUTCHENS, R. M. "Welfare,Remarriage,and Marital Search," Amer. Econ. Rev., June 1979, pp. 36979. JOHNSON, N. L. AND KOTZ, S. Continuousunivariate distributions. Vol 1. Boston: Houghton Mifflin, 1970a. . Continuous univariate distributions. Vol. 2.

Boston: Houghton Mifflin,1970b. . Distributions in statistics: Continuous multi-

variate distributions. New York:Wiley, 1972. JUDGE, G. G.; GRIFFITHS, W. E.; HILL, R. C. et al. The theory and practice of econometrics. New York:Wiley, 1980. KAHN, L. M. AND MORIMUNE, K. "Unionsand Em-


Amemiya: Qualitative Response Models ployment Stability:A Sequential Logit Approach," Int. Econ. Rev., Feb. 1979, pp. 217-36. KAU, J. B. AND RUBIN, P. H. "Voting on Minimum Wages: A Time Series Approach,"1. Polit. Econ., Apr. 1978, pp. 337-42. KOHN, M. G., MANSKI, C. F. AND MUNDEL, D. S. "An Empirical Investigation of Factors which Influence College-GoingBehavior,"Ann. Econ. Soc. Measure.,Fall 1976, pp. 391-420. LACHENBRUCH, P. A., SNEERINGER, C. AND REVO, L. T. "Robustnessof the Linearand QuadraticDiscriminant Function to Certain Types of Non-normality," Comm. in Statist., 1973, 1(1), pp. 3956. LANCASTER, K. J. "A New Approach to Consumer Theory," J Polit. Econ., April 1966, pp. 132-57. LAVE, C. A. "The Demand for Urban MassTransportation," Rev. Econ. Statist., Aug. 1970, pp. 32023. LAZEAR, E. P. "Why Is There Mandatory Retirement?," J Polit. Econ., Dec. 1979, pp. 1261-84. LEE, L. F. "Estimation of a Modal Choice Model for the Work Journey with Incomplete Observations." Mimeographed. Dept. of Economics, University of Minnesota, 1977. . "Unionism and Wage Rates: A Simultaneous

Equations Model with Qualitative and Limited Dependent Variables,"Int. Econ. Rev.,June 1978, pp. 415-34. LEE, M. L. "An Analysis of Installment Borrowing by Durable Goods Buyers," Econometrica, Oct. 1962, pp. 770-87. LI, M. M. "A Logit Model of Homeownership," Econometrica,July 1977, pp. 1081-98. LONG, J. E. AND JONES, E. B. "Labor Force Entry and Exit by MarriedWomen:A LongitudinalAnalysis," Rev. Econ. Statist., Feb. 1980, pp. 1-6. LUCE, R. D. AND SUPPES, P. "Preference, Utility, and Subjective Probability," in Handbook of mathematical psychology. Vol. III. Edited by R. D. LuCE, R. R. BUSH, AND E. GALANTER. New York:Wiley, 1963, pp. 249-410. MADDALA, G. S. Econometrics.New York:McGrawHill, 1977. MANSKI, C. F. "MaximumScore Estimation of the Stochastic Utility Model of Choice," J Econometrics, Aug. 1975, pp. 205-28. AND LERMAN, S. R. "The Estimation

of

Choice Probabilitiesfrom Choice-BasedSamples," Econometrica,Nov. 1977, pp. 1977-88. AND McFADDEN,

D. "Alternative Estimators

and Sample Designs for Discrete Choice Analysis," in Structuralanalysis of discrete data. Edited by C. F. MANSKI AND D. McFADDEN. Cambridge, Mass:MIT Press, forthcoming 1981a. AND McFADDEN,

D., eds. Structural analysis

of discrete data. Cambridge, Mass.: MIT Press, forthcoming 1981b. MARSCHAK, J. "Binary Choice Constraints on Random Utility Indicators,"in Stanford symposium on mathematical methods in the social sciences. Edited by K. ARROW. Stanford, Calif.: Stanford University Press, 1960, pp. 312-39. McFADDEN, D. "ConditionalLogit Analysisof Quali-

1535

tative Choice Behavior,"in Frontiersin econometrics. Edited by P. ZAREMBKA. New York: Academic Press, 1974, pp. 105-42. . "Quantal Choice Analysis: A Survey," Ann.

Econ. Soc. Measure.,Fall, 1976a, pp. 363-90. . "A Comment on Discriminant Analysis 'ver-

sus' Logit Analysis,"Ann. Econ. Soc. Measure.,Fall 1976b, pp. 511-24. . "The Revealed Preferences of a Government

Bureaucracy:Empirical Evidence," Bell J Econ. Manage. Sci., Spring 1976c, pp. 55-72. . "Quantitative Methods for Analyzing Travel

Behavior of Individuals: Some Recent Developments." Cowles Foundation Discussion Paper No. 474, Nov. 1977. Published in Behavioral travel modeling. Edited by D. HENSHER AND P. STOPHER. London: Croom-Heim, 1979. . "Modeling the Choice of Residential Location,"in Spatial interaction theoryand residential location. Edited by A. KARLQVIST, et al. Amsterdam: North-Holland,1978, pp. 75-96. .

"Econometric

Models

of

Probabilistic

Choice," in Structural analysis of discrete data. Edited by C. F. MANSKI AND D. McFADDEN. Cambridge, Mass.:MIT Press, forthcoming 1981. AND REID, F. "Aggregate Travel Demand

Forecasting from Disaggregated BehavioralModels," National Academy of Science, National Research Council, TransportationResearch Board, Record,No. 534, Washington,D.C., 1975. McGILLIVRAY,R. G. "Binary Choice of Urban Trans-

port Mode in the San Francisco Bay Region," Econometrica,Sept. 1972, pp. 827-48. MEDOFF,J. L. "Layoffs and Alternatives Under TradeUnions in U.S. Manufacturing,"Amer.Econ. Rev.,June 1979, pp. 380-95. MOORE,W. J.;NEWMAN,R. J. AND THOMAS,R. W. "Determinants of the Passage of Right-to-Work Laws: An Alternative Interpretation," J. Law Econ., Apr. 1974, pp. 197-211. MORIMUNE,K. "Comparisons of Normal and Logistic

Models in the Bivariate Dichotomous Analysis," Econometrica,July 1979, pp. 957-76. MORRISON,D. G. "Upper Bounds for Correlations

Between Binary Outcomes and ProbabilisticPredictions,"J. Amer. Statist. Assoc., Mar. 1972, pp. 68-70. NERLOVE,M. AND PRESS,S. J. "Univariate and Multi-

variate Log-Linearand Logistic Models."Mimeographed. No. R-1306-EDA/NIH, Rand Corp., Santa Monica, 1973. S. "Educationand Lifetime Patternsof UnNICKELL, employment,"J Polit. Econ., Oct. 1979, pp. S11731. OKSANEN,E. H. AND WILLIAMS,J. R. "International

Cost Differences-A Comparisonof Canadianand United States Manufacturing Industries," Rev. Econ. Statist., Feb. 1978, pp. 96-101. OSTEN,S. "IndustrialSearch for New Locations:An EmpiricalAnalysis,"Rev. Econ. Statist.,May 1979, pp. 288-92. PARKS,R. W. "Determinants of Scrapping Rate for Postwar Vintage Automobiles," Econometrica, July 1977, pp. 1099-1116.


1536


. "On the Estimation of Multinomial Logit

Models from Relative Frequency Data," J Econometrics, Aug. 1980, pp. 293-304. PARSONS, D. 0. "The Decline in Male Labor Force Participation,"I. Polit. Econ.,Feb. 1980a, pp. 11734. . "Racial Trends in Male Labor Force Partici-

pation," Amer. Econ. Rev., Dec. 1980b, pp. 91120. PENCAVEL, J. H. "MarketWork Decisions and Unemployment of Husbandsand Wives in the Seattle and Denver Income Maintenance Experiments." Mimeographed. April, 1979. PERLOFF, J. M. AND WACHTER, M. L. "The New Jobs Tax Credit: An Evaluation of the 1977-78 Wage Subsidy Program,"Amer. Econ. Rev., May 1979, pp. 173-79. PINDYCK, R. S. AND RUBINFELD, D. L. Econometric models and economic forecasts. Second edition. New York:McGraw-Hill,1981. PLACKETT, R. L. The analysis of categorical data. London: Griffin,1974. POWERS, J. A.; MARSH, L. C.; HUCKFELDT, R. R. et al. "A Comparisonof Logit, Probit and Discriminant Analysis in Predicting Family Size," Amer. Statist. Assoc. Proceedingsof the Soc. Statist. Section, Aug. 1978, pp. 693-97. PRENTICE, R. L. "A Generalizationof the Probit and Logit Methods for Dose Response Curves," Biometrics, Dec. 1976, pp. 761-68. PRESS,S. J. AND WILSON,S. "Choosing Between Lo-

gistic Regression and Discriminant Analysis," J Amer. Statist. Assoc., Dec. 1978, pp. 699-705. QUANDT, R. E. "Estimationof ModalSplits," Transp. Res., 1968, pp. 41-50. AND BAUMOL, W. J. "The Demand for Abstract Transport Modes: Theory and Measurement," J. Reg. Sci., Winter 1966, pp. 13-26. RADNER, R. AND MILLER, L. S. "Demand and Supply in U.S. Higher Education: A Progress Report," Amer. Econ. Rev., Papers and Proceedings, May 1970, pp. 326-34. RAO,C. R. Linear statistical inference and its applications. Second edition. New York: Wiley, 1973. ROSEN, H. S. AND ROSEN, K. T. "Federal Taxes and Homeownership: Evidence from Time Series," J Polit. Econ., Feb. 1980, pp. 59-75. SCHILLER, B. R. AND WEIss, R. D. "The Impact of Private Pensionson Firm Attachment,"Rev. Econ. Statist., Aug. 1979, pp. 369-80. SCHMIDT, P. AND STRAUSS, R. P. "The Prediction of Occupation Using Multiple Logit Models," Int. Econ. Rev.,June 1975, pp. 471-86. . "The Effects of Unions on Earnings and Earn-

ings on Unions:A MixedLogit Analysis,"Int. Econ. Rev., Feb. 1976, pp. 204-12. SILBERMAN, J. I. AND DURDEN, G. C. "Determining Legislative Preferences on the Minimum Wage:

An Economic Approach," J Polit. Econ., Apr. 1976, pp. 317-29. AND TALLEY, W. K. "N-Chotomous Depen-

dent Variables:An Applicationto RegulatoryDecision Making,"Amer. Statist. Assoc. Proceedings of the Bus. Econ. Statist. Section, Aug. 1974, pp. 573-76. SMITH, J. P. "The Distribution of Family Earnings," J Polit. Econ., Oct. 1979, pp. S163-92. TALVETIE, A. "Comparisonof ProbabilisticModalChoice Models: Estimation Methods and System Inputs," National Academy of Science, National Research Council, Highway Research Board, Record,No. 392, 1972, pp. 111-20. THEIL, H. Principles of econometrics. New York: Wi-

ley, 1971. L. "A Law of Comparative Judgement," Psych. Rev., 1927, 34, pp. 273-86. TOBIN, J. "The Estimation of Relationshipsfor Limited Dependent Variables," Econometrica, Jan. 1958, pp. 24-36. TOLLEFSON, J. 0. AND J. A. PICHLER. "A Comment on 'Right-to-Work'Laws: A Suggested Economic Rationale,"J. Law Econ., Apr. 1974, pp. 193-96. UHLER, R. S. "The Demand for Housing:An Inverse Probability Approach," Rev. Econ. Statist., Feb. 1968, pp. 129-34. VAN MONTFORT, M. A. J. AND OTTEN, A. "Quantal Response Analysis: Enlargement of the Logistic Model with a Kurtosis Parameter," Biometrische Zeitschrift, 1976, 18(5), pp. 371-80. WARNER, S. L. Stochastic choice of mode in urban travel-A study in binary choice. Evanston, Ill.: Northwestern University Press, 1962.

THURSTONE,

. "Multivariate Regression of Dummy Variates

Under Normality Assumptions,"J. Amer. Statist. Assoc., Dec. 1963, pp. 1054-63. WARREN, R. S. AND STRUASS, R. P. "A Mixed Logit Model of the RelationshipBetween Unionization and Right-to-Work Legislation," J Polit. Econ., June 1979, pp. 648-55. WATSON, P. L. AND WESTIN, R. B. "Transferability of Disaggregate Mode Choice Models," Reg. Sci. and Urban Econ., May 1975, pp. 227-49. WESTIN, R. B. "Predictionsfrom BinaryChoice Models," J Econometrics,May 1974, pp. 1-16. WILENSKY, G. R. AND ROSSITER,L. F. "OLS and Logit Estimation in a Physician Location Study," Amer.Statist.Assoc. Proceedingsof the Soc.Statist. Section, Aug. 1978, pp. 260-65. WILLIS, R. J. AND ROSEN, S. "Education and SelfSelection," J Polit. Econ., Oct. 1979, S7-36. WITTE, A. D. AND SCHMIDT, P. "An Analysisof the Type of CriminalActivity Using the Logit Model," J Res. in Crime and Delinquency, Jan. 1979, pp. 164-79. Wu, D. M. "An EmpiricalAnalysisof HouseholdDurable Goods Expenditure," Econometrica, Oct. 1965, pp. 761-80.


AMEMIYA Takeshi Qualitative Responde Model (Article - Jstor)

Recommend Documents