CFA Using Stata

Confirmatory Factor Analysis Using Stata 12.1 Three Main Points:

1. The very basics of Stata CFA/SEM syntax 2. One Factor CFA 3. Two Factor CFA

To begin, we should start on a good note… There is – is – in in my opinion – opinion – really really good news: In terms of conducting most m ost analyses, the syntax for CFA/SEM in Stata 12.1 is far, far, far simpler than than that of LISREL. (Note: you cannot use earlier versions of Stata for SEM – SEM – Stata Stata 12.1 is the first version that includes this method) Now take your seat, buckle up, and get ready for another ride on the nerd bus.

Basic CFA/SEM Syntax Using Stata: Syntax Basics

The most basic language is that which specifies the relationship between the latent constructs and the observed variables. You do this simply by specifying a latent construct (using (using CAPITAL letters) that is comprised of observed variables that are in the data set. For example: sem (DEPRESS -> x1 x2 x3) or sem (x1 x2 x3 <- MONKEY)

Please note that the order between the latent construct and the observed variables does not matter and it does not matter what you name the latent construct (Note: while you can have fun with absurd names, it is best to make them short and sweet). However, you do need to be careful to have the “arrow “arrow”” – which – which is simply comprised of a hyphen (“(“-“) and a greater than/less than (“>” or “<”) – pointed pointed in the right direction: away a way from the latent construct. 1

Another basic command is the “cov” option. This allows you to specify that the error terms for particular observed variables be allowed to co-vary. For example: sem (DEPRESS -> x1 x2 x3), cov (e.x1*e.x2 e.x1*e.x3) or sem (DEPRESS -> x1 x2 x3), cov (e.x1*e.x2) cov (e.x1*e.x3)

Again, the syntax here is relatively flexible as you have a variety of options of how you can specify that certain observed variables be allowed to correlate. These are the basic, basic concepts of Stata syntax for CFA/SEM. With these, we easily have enough information to look at an example and – in doing so – to become familiar with some more CFA/SEM syntax.

One Factor CFA

The data for this example – indeed, for all the examples in this session – comes from National Survey of Drug Use and Health (NSDUH) which is really interesting nationally representative, cross-sectional substance use data for individuals from the ages of 12 and up. This particular data set is a subset of the NSDUH comprised of truant adolescents males between the ages of 12 and 17 years old (M age = 15.3, SD = 0.057). As these data are coded, roughly 3/5 are White Non-Hispanic (57.71%) while the remaining 2/5 are youth of color (African-American/Hispanic) (42.29%). Let's open up the data set and first use the " set more off " command so that Stata does not drive us crazy by requiring us to click “more” every 5 seconds. use "L:\TranCFA_red.dta" set more off

We can start with an example of confirmatory factor analysis using five variables related to adolescent academic involvement. s_felt s_work s_imp s_int s_job

Felt about going to school Frequency of school feeling meaningful How important things learned will be in future Courses interesting Teacher let you know you did a good job

2

One Factor CFA

Just so we can visualize it, this is the path diagram for the proposed model:

Felt About School School Meaningful

Academic Involvement

Important for Future Courses Interesting Teacher: Good Job Now that we have our proposed confirmatory factor model, let's write up the syntax and see what we get. I am going to use the “ stand ” option so that we get standardized coefficients: sem (INVOLVE -> s_felt s_work s_imp s_int s_job), stand

Endogenous variables Measurement:

s_felt s_work s_imp s_int s_job

Exogenous variables Latent:

INVOLVE

Fitting target model: Iteration 0:

log likelihood = -4225.2004

Iteration 1:

log likelihood =

Iteration 2:


Iteration 3:


-4224.09

Structural equation model Estimation method

= ml

Log likelihood

= -4224.0826

( 1)

Number of obs

[s_felt]INVOLVE = 1

3

=

725

This first part of the syntax tells us a few things (see above). First, it identifies our five observed variables as “measurement” variables that contribute to the identification of the latent variable we have named INVOLVE. It also shows us how many iterations – using maximum likelihood – it took for Stata to set the target model (3 is good! After all, this is a relatively simple model…). It also tells us our number of observations (725), the log likelihood value for model, and that the factor loading for s_felt was fixed to 1.0. Next, we can look at the rest of the standardized output: OIM Standardized

Coef.

Std. Err.

z

P>|z|

[95% Conf. Interval]

23.65 34.65

0.000 0.000

.5887737 2.925775

Measurement s_felt
.6419818 3.101211

.0271475 .0895102

.6951898 3.276648

s_work
.6795142

.0269338

25.23

0.000

.6267248

.7323036

_cons

3.335998

.0951546

35.06

0.000

3.149498

3.522498

INVOLVE

.6139033

.0288177

21.30

0.000

.5574217

.6703849

_cons

3.497442

.0990719

35.30

0.000

3.303265

3.69162

INVOLVE _cons

.8010629 3.039006

.0225112 .0880265

35.59 34.52

0.000 0.000

.7569417 2.866477

.845184 3.211534

s_job
.4942271

.0326628

15.13

0.000

.4302092

.558245

_cons

3.44226

.0977299

35.22

0.000

3.250713

3.633808

e.s_felt

.5878594

.0348563

.5233625

.6603046

e.s_work e.s_imp

.5382605 .6231228

.0366039 .0353825

.4710938 .557494

.6150035 .6964774

e.s_int

.3582982

.0360658

.2941467

.4364408

e.s_job INVOLVE

.7557396 1

.0322857 .

.6950376 .

.821743 .

s_imp <-

s_int <-

Variance

LR test of model vs. saturated: chi2(5)

=

35.30, Prob > chi2 = 0.0000

This output gives us standardized factor loading v alues for each of the five observed variables as well as their standard error, significance, and confidence intervals. For example, the standardized factor loading for s_felt onto the latent construct INVOLVE was 0.64 with a standard error of 0.027. It was significant at p < .001 and had a 95% confidence interval that ranged from 0.59 to 0.69. All this looks good. The output also provides us with the chi-square value of 35.30, the degrees of freedom of 5, and the significance of the chi-square test (i.e. p < . 001). These preliminary goodness of fit statistics

4

suggest that the model may not fit the data all that well. However, Stata can provide us with more information. If you use the following syntax after running your CFA model, you will get additional goodness of fit statistics: estat gof, stats(all) Fit statistic

Value

Description

Likelihood ratio chi2_ms(5) p > chi2

35.295 0.000

chi2_bs(10)

953.586

p > chi2

0.000

model vs. saturated baseline vs. saturated

Population error RMSEA 90% CI, lower bound

0.091 0.064

upper bound

0.121

pclose

0.007

Root mean squared error of approximation

Probability RMSEA <= 0.05

Information criteria AIC BIC

8478.165 8546.958

Akaike's information criterion Bayesian information criterion

Baseline comparison CFI

0.968

Comparative fit index

TLI

0.936

Tucker-Lewis index

SRMR

0.028

Standardized root mean squared residual

CD

0.811

Coefficient of determination

Size of residuals

This provides us with some of the goodness of fit statistics we are familiar with using LISREL. For instance, we can see that the RMSEA value is 0.091, that the CFI value is 0.968 and that the TLI value is 0.936. The CD value of 0.811 provides information similar to the R-squared value you get using OLS and other forms of regression. In all, looking at these goodness of fit statistics suggest that the fit of the model to the data is may benefit from modification. At this point, it would be helpful to examine the modification indices and see if – purely in an empirical sense – any additional paths could be specified that might improve model fit. Is anyone else as excited as I am? Please don't raise your hand. To get the modification indices we type the following after having run the CFA model: estat mindices

5

Modification indices

Standard MI

df

P>MI

EPC

EPC

e.s_imp

13.420

1

0.00

-.0848231

-.174979

e.s_int

11.096

1

0.00

.0878639

.2376938

e.s_imp

27.714

1

0.00

.1175849

.2650019

e.s_int

19.436

1

0.00

-.1170219

-.3458601

5.050

1

0.02

.0517828

.1280503

Covariance e.s_felt

e.s_work

e.s_int e.s_job

EPC = expected parameter change

These modification indices give us some important information about omitted paths in the fitted model. Two particularly salient points stand out. First the variables in the “covariance” column identify potential suggested paths. Second, the numbers in the MI column represent the decreases in the Chi-Squared value that will result if the suggested paths added. For instance, if a path were added from e.s_imp to e.s_work then the chi-squared value will decrease by value of 27.714. For pedagogical purposes, let's just say that we have a theoretical or conceptual reason to add a path between these error terms (i.e. perceived feelings of the importance and meaningfulness of school are conceptually akin and thus it is conceptually plausible that the error terms might be correlated, etc.). In such a case, we would add this path using the following syntax: sem (INVOLVE -> s_felt s_work s_imp s_int s_job), cov (e.s_imp*e.s_work) stand

The output for this syntax is omitted, however, if you run this model you will see that the new chi-square value of 8.88 for the modified model is approximately 27 units lower than the initial chi-squared value for the proposed model. If you examine the goodness of fit statistics again you will also see a difference between the proposed modified models. estat gof, stats(all)

Because I have nothing better to do on a Sunday afternoon (eek!), I created a table to show the difference between the goodness of fit statistics for the proposed model (i.e. the model without the path from e.s_imp to e.s_work) and the modified model.

6

Fit Indexes X (df) RMSEA CFI TLI CD

Academic Involvement Proposed Model Modified Model 8.876 (4) 35.295 (5)*** 0.091 0.041 0.968 0.995 0.936 0.987 0.811 0.812

Note: † p<.10, *p <.05, **p < .01, ***p < .001. Coefficients in bold are statistically significant at p < .05 or lower.

As you can see, the goodness of fit of the modified model is superior to that of the proposed model. Pretty simple, no? Here's what the path diagram looks like with the standardized values and error terms. You have to make this yourself using Microsoft Word or some other program. Stata won’t do it for you.

e1

Felt About School .65***

e2

School Meaningful

.63***

.56***

.23***

e3

Important for Future

.84***

e4

Courses Interesting .49***

e5

Teacher: Good Job

7

Academic Involvement

Two Factor CFA

So, the first example looked at a confirmatory factor analysis for single factor latent construct. Using the same basic syntax, we can do a very similar procedure for a two factor latent construct. This is a pretty simple procedure. Let’s say were interested in a latent construct of "school involvement" that is comprised of two latent factors: Academic En gagement (made up of the five observed variables that we examined above) AND Parental I nvolvement . In this case, parental involvement is made up of the following two variables: p_checkwrk p_homework

Frequency of parents checking homework Frequency of parents helping with homework

Let’s start by looking at a proposed path diagram that can help us to conceptualize what this two factor model looks and feels like. Felt About School School Meaningful

Academic Engagement

Important for Future Courses Interesting

Check Homework Parental Involvement

Teacher: Good Job

Help w/ Homework

The syntax for this model is very similar to that of a one factor confirmatory factor analysis. The only difference is that we add some additional syntax to specify that the observed variables hypothesized as loading onto the latent parental in volvement variable should be included as well Here it is: sem (INVOLVE -> s_felt s_work s_imp s_int s_job) (PARENT -> p_checkwrk p_homework), stand 8

Just looking at the measurement component of the output, we can see that the observed variables load reasonably well onto their corresponding latent constructs.

OIM Standardized

Coef.

Std. Err.

z

P>|z|

[95% Conf. Interval]

Measurement s_felt
.639671

.027285

23.44

0.000

.5861934

.6931486

_cons

3.099758

.0897856

34.52

0.000

2.923781

3.275734

INVOLVE

.6813196

.0266317

25.58

0.000

.6291224

.7335168

_cons

3.329927

.0953372

34.93

0.000

3.143069

3.516784

INVOLVE

.62111

.0285262

21.77

0.000

.5651997

.6770203

_cons

3.505355

.0996087

35.19

0.000

3.310126

3.700585

INVOLVE

.796785

.0223727

35.61

0.000

.7529354

.8406347

_cons

3.027399

.0880544

34.38

0.000

2.854816

3.199983

INVOLVE

.4991685

.0326156

15.30

0.000

.4352431

.5630939

_cons

3.433842

.0978635

35.09

0.000

3.242033

3.625651

PARENT

.8637905

.1054045

8.20

0.000

.6572015

1.07038

_cons

3.015601

.0877728

34.36

0.000

2.843569

3.187632

PARENT

.573103

.0740109

7.74

0.000

.4280443

.7181618

_cons

2.744652

.0813647

33.73

0.000

2.58518

2.904124

s_work <-

s_imp <-

s_int <-

s_job <-

p_checkwrk <-

p_homework <-

Of course, as we did with the single factor model, we could also look at the goodness of fit statistics using the following command: estat gof, stats(all) And look at any modification indices using the following command: estat mindices Additionally, were you concerned about missing values, you could also run the model using an approach that is conceptually akin to – but functionally simpler than – multiple imputation. This can be used for variables that are either missing at random or missing completely at random: sem (INVOLVE -> s_felt s_work s_imp s_int s_job) (PARENT -> p_checkwrk p_homework), method(mlmv) stand 9

Other Fun Facts

There are several other post estimation commands that can provide you with more information about your CFA models.

Equation by Equation Goodness of Fit

Suppose you are dealing with severe insomnia and you decided that examining the goodness of fit for individual items might prove to be a useful and non-habit-forming sleep remedy. If this were the case, then you could try the following post estimation syntax: estat eqgof Equation-level goodness of fit

Variance depvars

fitted

predicted

residual

R-squared

mc

mc2

s_felt

.7920552

.3264381

.4656171

.4121406

.6419818

.4121406

s_work

.7247448

.3346433

.3901015

.4617395

.6795142

.4617395

s_imp

.8099424

.3052489

.5046936

.3768772

.6139033

.3768772

s_int

.8190516

.5255869

.2934647

.6417018

.8010629

.6417018

s_job

.7373622

.1801084

.5572538

.2442604

.4942271

.2442604

observed

overall mc

.8105315

= correlation between depvar and its prediction

mc2 = mc^2 is the Bentler-Raykov squared multiple correlation coefficient

Above is the output for the one factor CFA model we examined at the outset. This output tells us several interesting things: 2

First, it provides the r values for each of the 5 observed items. Please note that this is simply squaring the standardized factor loading value for each of the items (i.e. in the case of “s_felt” with a standardized factor loading coefficient of 0. 6419, squaring this gives us .412), but might be useful to you. The “mc” and “mc2” values are squared multiple correlation coefficients, but 2 these values are identical to the standardized factor loading coefficient and the r value in recursive models. Additionally, the equation level variance is dec omposed into fitted, predicted, and residual variance for each item. Also, the coefficient of determination – which basically is like a full 2 model r is also provided as “overall”. It is the same value we got above in the GOF output. Anyone feeling somnolent now? 10

Traditional SEM notation

Some of you may be missing the old, traditional SEM notation. You may want to call a friend and talk this over. Another option would be to simply use the following post estimation syntax which gives you two pages of the “good stuff ”: estat framework Endogenous variables on endogenous variables observed Beta

s_felt

s_work

s_imp

s_int

s_job

observed s_felt

0

s_work

0

0

s_imp

0

0

0

s_int

0

0

0

0

s_job

0

0

0

0

0

e.s_imp

e.s_int

e.s_job

Exogenous variables on endogenous variables latent Gamma

INVOLVE

observed s_felt

1

s_work s_imp

1.01249 .9670003

s_int

1.268884

s_job

.7427909

Covariances of error variables

Psi

observed e.s_felt

e.s_felt

.4656171

e.s_work

observed e.s_work

0

.3901015

e.s_imp

0

0

.5046936

e.s_int

0

0

0

.2934647

e.s_job

0

0

0

0

.5572538

Intercepts of endogenous variables observed alpha

s_felt

s_work

s_imp

s_int

s_job

_cons

2.76

2.84

3.147586

2.750345

2.955862

11

Covariances of error variables observed Psi

e.s_felt

e.s_work

e.s_imp

e.s_int

e.s_job

e.s_felt

.4656171

e.s_work

0

.3901015

e.s_imp

0

0

.5046936

e.s_int

0

0

0

.2934647

e.s_job

0

0

0

0

.5572538

observed

Intercepts of endogenous variables observed alpha

s_felt

s_work

s_imp

s_int

s_job

_cons

2.76

2.84

3.147586

2.750345

2.955862

Covariances of exogenous variables latent Phi

INVOLVE

INVOLVE

.3264381

latent

Means of exogenous variables latent kappa

INVOLVE

mean

0

As you can see, you get a number of different estimation results in matrix format. Please note that there are several options that you can use to further specify this output. Type in the following command for more information: help estat framework In all, this is just the basic introduction to confirmatory factor anal ysis using Stata 12.1. Hopefully it has been helpful, but I strongly encourage you to consult the Stata 12.1 user’s manual for further detail and a plethora of additional options. After you get the basics, the best way to learn this stuff is to simply run models o ver and over while consulting the user’s manual, your colleagues, and the Stata list (http://www.stata.com/statalist/). 12

CFA Using Stata

Recommend Documents