Confirmatory Factor Analysis Using Stata 12.1 Three Main Points:
1. The very basics of Stata CFA/SEM syntax 2. One Factor CFA 3. Two Factor CFA
To begin, we should start on a good note… There is – is – in in my opinion – opinion – really really good news: In terms of conducting most m ost analyses, the syntax for CFA/SEM in Stata 12.1 is far, far, far simpler than than that of LISREL. (Note: you cannot use earlier versions of Stata for SEM – SEM – Stata Stata 12.1 is the first version that includes this method) Now take your seat, buckle up, and get ready for another ride on the nerd bus.
Basic CFA/SEM Syntax Using Stata: Syntax Basics
The most basic language is that which specifies the relationship between the latent constructs and the observed variables. You do this simply by specifying a latent construct (using (using CAPITAL letters) that is comprised of observed variables that are in the data set. For example: sem (DEPRESS -> x1 x2 x3) or sem (x1 x2 x3 <- MONKEY)
Please note that the order between the latent construct and the observed variables does not matter and it does not matter what you name the latent construct (Note: while you can have fun with absurd names, it is best to make them short and sweet). However, you do need to be careful to have the “arrow “arrow”” – which – which is simply comprised of a hyphen (“(“-“) and a greater than/less than (“>” or “<”) – pointed pointed in the right direction: away a way from the latent construct. 1
Another basic command is the “cov” option. This allows you to specify that the error terms for particular observed variables be allowed to co-vary. For example: sem (DEPRESS -> x1 x2 x3), cov (e.x1*e.x2 e.x1*e.x3) or sem (DEPRESS -> x1 x2 x3), cov (e.x1*e.x2) cov (e.x1*e.x3)
Again, the syntax here is relatively flexible as you have a variety of options of how you can specify that certain observed variables be allowed to correlate. These are the basic, basic concepts of Stata syntax for CFA/SEM. With these, we easily have enough information to look at an example and – in doing so – to become familiar with some more CFA/SEM syntax.
One Factor CFA
The data for this example – indeed, for all the examples in this session – comes from National Survey of Drug Use and Health (NSDUH) which is really interesting nationally representative, cross-sectional substance use data for individuals from the ages of 12 and up. This particular data set is a subset of the NSDUH comprised of truant adolescents males between the ages of 12 and 17 years old (M age = 15.3, SD = 0.057). As these data are coded, roughly 3/5 are White Non-Hispanic (57.71%) while the remaining 2/5 are youth of color (African-American/Hispanic) (42.29%). Let's open up the data set and first use the " set more off " command so that Stata does not drive us crazy by requiring us to click “more” every 5 seconds. use "L:\TranCFA_red.dta" set more off
We can start with an example of confirmatory factor analysis using five variables related to adolescent academic involvement. s_felt s_work s_imp s_int s_job
Felt about going to school Frequency of school feeling meaningful How important things learned will be in future Courses interesting Teacher let you know you did a good job
2
One Factor CFA
Just so we can visualize it, this is the path diagram for the proposed model:
Felt About School School Meaningful
Academic Involvement
Important for Future Courses Interesting Teacher: Good Job Now that we have our proposed confirmatory factor model, let's write up the syntax and see what we get. I am going to use the “ stand ” option so that we get standardized coefficients: sem (INVOLVE -> s_felt s_work s_imp s_int s_job), stand
Endogenous variables Measurement:
s_felt s_work s_imp s_int s_job
Exogenous variables Latent:
INVOLVE
Fitting target model: Iteration 0:
log likelihood = -4225.2004
Iteration 1:
log likelihood =
Iteration 2:
log likelihood = -4224.0826
Iteration 3:
log likelihood = -4224.0826
-4224.09
Structural equation model Estimation method
= ml
Log likelihood
= -4224.0826
( 1)
Number of obs
[s_felt]INVOLVE = 1
3
=
725
This first part of the syntax tells us a few things (see above). First, it identifies our five observed variables as “measurement” variables that contribute to the identification of the latent variable we have named INVOLVE. It also shows us how many iterations – using maximum likelihood – it took for Stata to set the target model (3 is good! After all, this is a relatively simple model…). It also tells us our number of observations (725), the log likelihood value for model, and that the factor loading for s_felt was fixed to 1.0. Next, we can look at the rest of the standardized output: OIM Standardized
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
23.65 34.65
0.000 0.000
.5887737 2.925775
Measurement s_felt
.6419818 3.101211
.0271475 .0895102
.6951898 3.276648
s_work
.6795142
.0269338
25.23
0.000
.6267248
.7323036
_cons
3.335998
.0951546
35.06
0.000
3.149498
3.522498
INVOLVE
.6139033
.0288177
21.30
0.000
.5574217
.6703849
_cons
3.497442
.0990719
35.30
0.000
3.303265
3.69162
INVOLVE _cons
.8010629 3.039006
.0225112 .0880265
35.59 34.52
0.000 0.000
.7569417 2.866477
.845184 3.211534
s_job
.4942271
.0326628
15.13
0.000
.4302092
.558245
_cons
3.44226
.0977299
35.22
0.000
3.250713
3.633808
e.s_felt
.5878594
.0348563
.5233625
.6603046
e.s_work e.s_imp
.5382605 .6231228
.0366039 .0353825
.4710938 .557494
.6150035 .6964774
e.s_int
.3582982
.0360658
.2941467
.4364408
e.s_job INVOLVE
.7557396 1
.0322857 .
.6950376 .
.821743 .
s_imp <-
s_int <-
Variance
LR test of model vs. saturated: chi2(5)
=
35.30, Prob > chi2 = 0.0000
This output gives us standardized factor loading v alues for each of the five observed variables as well as their standard error, significance, and confidence intervals. For example, the standardized factor loading for s_felt onto the latent construct INVOLVE was 0.64 with a standard error of 0.027. It was significant at p < .001 and had a 95% confidence interval that ranged from 0.59 to 0.69. All this looks good. The output also provides us with the chi-square value of 35.30, the degrees of freedom of 5, and the significance of the chi-square test (i.e. p < . 001). These preliminary goodness of fit statistics
4
suggest that the model may not fit the data all that well. However, Stata can provide us with more information. If you use the following syntax after running your CFA model, you will get additional goodness of fit statistics: estat gof, stats(all) Fit statistic
Value
Description
Likelihood ratio chi2_ms(5) p > chi2
35.295 0.000
chi2_bs(10)
953.586
p > chi2
0.000
model vs. saturated baseline vs. saturated
Population error RMSEA 90% CI, lower bound
0.091 0.064
upper bound
0.121
pclose
0.007
Root mean squared error of approximation
Probability RMSEA <= 0.05
Information criteria AIC BIC
8478.165 8546.958
Akaike's information criterion Bayesian information criterion
Baseline comparison CFI
0.968
Comparative fit index
TLI
0.936
Tucker-Lewis index
SRMR
0.028
Standardized root mean squared residual
CD
0.811
Coefficient of determination
Size of residuals
This provides us with some of the goodness of fit statistics we are familiar with using LISREL. For instance, we can see that the RMSEA value is 0.091, that the CFI value is 0.968 and that the TLI value is 0.936. The CD value of 0.811 provides information similar to the R-squared value you get using OLS and other forms of regression. In all, looking at these goodness of fit statistics suggest that the fit of the model to the data is may benefit from modification. At this point, it would be helpful to examine the modification indices and see if – purely in an empirical sense – any additional paths could be specified that might improve model fit. Is anyone else as excited as I am? Please don't raise your hand. To get the modification indices we type the following after having run the CFA model: estat mindices
5
Modification indices
Standard MI
df
P>MI
EPC
EPC
e.s_imp
13.420
1
0.00
-.0848231
-.174979
e.s_int
11.096
1
0.00
.0878639
.2376938
e.s_imp
27.714
1
0.00
.1175849
.2650019
e.s_int
19.436
1
0.00
-.1170219
-.3458601
5.050
1
0.02
.0517828
.1280503
Covariance e.s_felt
e.s_work
e.s_int e.s_job
EPC = expected parameter change
These modification indices give us some important information about omitted paths in the fitted model. Two particularly salient points stand out. First the variables in the “covariance” column identify potential suggested paths. Second, the numbers in the MI column represent the decreases in the Chi-Squared value that will result if the suggested paths added. For instance, if a path were added from e.s_imp to e.s_work then the chi-squared value will decrease by value of 27.714. For pedagogical purposes, let's just say that we have a theoretical or conceptual reason to add a path between these error terms (i.e. perceived feelings of the importance and meaningfulness of school are conceptually akin and thus it is conceptually plausible that the error terms might be correlated, etc.). In such a case, we would add this path using the following syntax: sem (INVOLVE -> s_felt s_work s_imp s_int s_job), cov (e.s_imp*e.s_work) stand
The output for this syntax is omitted, however, if you run this model you will see that the new chi-square value of 8.88 for the modified model is approximately 27 units lower than the initial chi-squared value for the proposed model. If you examine the goodness of fit statistics again you will also see a difference between the proposed modified models. estat gof, stats(all)
Because I have nothing better to do on a Sunday afternoon (eek!), I created a table to show the difference between the goodness of fit statistics for the proposed model (i.e. the model without the path from e.s_imp to e.s_work) and the modified model.
6
Fit Indexes X (df) RMSEA CFI TLI CD
Academic Involvement Proposed Model Modified Model 8.876 (4) 35.295 (5)*** 0.091 0.041 0.968 0.995 0.936 0.987 0.811 0.812
Note: † p<.10, *p <.05, **p < .01, ***p < .001. Coefficients in bold are statistically significant at p < .05 or lower.
As you can see, the goodness of fit of the modified model is superior to that of the proposed model. Pretty simple, no? Here's what the path diagram looks like with the standardized values and error terms. You have to make this yourself using Microsoft Word or some other program. Stata won’t do it for you.
e1
Felt About School .65***
e2
School Meaningful
.63***
.56***
.23***
e3
Important for Future
.84***
e4
Courses Interesting .49***
e5
Teacher: Good Job
7
Academic Involvement
Two Factor CFA
So, the first example looked at a confirmatory factor analysis for single factor latent construct. Using the same basic syntax, we can do a very similar procedure for a two factor latent construct. This is a pretty simple procedure. Let’s say were interested in a latent construct of "school involvement" that is comprised of two latent factors: Academic En gagement (made up of the five observed variables that we examined above) AND Parental I nvolvement . In this case, parental involvement is made up of the following two variables: p_checkwrk p_homework
Frequency of parents checking homework Frequency of parents helping with homework
Let’s start by looking at a proposed path diagram that can help us to conceptualize what this two factor model looks and feels like. Felt About School School Meaningful
Academic Engagement
Important for Future Courses Interesting
Check Homework Parental Involvement
Teacher: Good Job
Help w/ Homework
The syntax for this model is very similar to that of a one factor confirmatory factor analysis. The only difference is that we add some additional syntax to specify that the observed variables hypothesized as loading onto the latent parental in volvement variable should be included as well Here it is: sem (INVOLVE -> s_felt s_work s_imp s_int s_job) (PARENT -> p_checkwrk p_homework), stand 8
Just looking at the measurement component of the output, we can see that the observed variables load reasonably well onto their corresponding latent constructs.
OIM Standardized
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
Measurement s_felt
.639671
.027285
23.44
0.000
.5861934
.6931486
_cons
3.099758
.0897856
34.52
0.000
2.923781
3.275734
INVOLVE
.6813196
.0266317
25.58
0.000
.6291224
.7335168
_cons
3.329927
.0953372
34.93
0.000
3.143069
3.516784
INVOLVE
.62111
.0285262
21.77
0.000
.5651997
.6770203
_cons
3.505355
.0996087
35.19
0.000
3.310126
3.700585
INVOLVE
.796785
.0223727
35.61
0.000
.7529354
.8406347
_cons
3.027399
.0880544
34.38
0.000
2.854816
3.199983
INVOLVE
.4991685
.0326156
15.30
0.000
.4352431
.5630939
_cons
3.433842
.0978635
35.09
0.000
3.242033
3.625651
PARENT
.8637905
.1054045
8.20
0.000
.6572015
1.07038
_cons
3.015601
.0877728
34.36
0.000
2.843569
3.187632
PARENT
.573103
.0740109
7.74
0.000
.4280443
.7181618
_cons
2.744652
.0813647
33.73
0.000
2.58518
2.904124
s_work <-
s_imp <-
s_int <-
s_job <-
p_checkwrk <-
p_homework <-
Of course, as we did with the single factor model, we could also look at the goodness of fit statistics using the following command: estat gof, stats(all) And look at any modification indices using the following command: estat mindices Additionally, were you concerned about missing values, you could also run the model using an approach that is conceptually akin to – but functionally simpler than – multiple imputation. This can be used for variables that are either missing at random or missing completely at random: sem (INVOLVE -> s_felt s_work s_imp s_int s_job) (PARENT -> p_checkwrk p_homework), method(mlmv) stand 9
Other Fun Facts
There are several other post estimation commands that can provide you with more information about your CFA models.
Equation by Equation Goodness of Fit
Suppose you are dealing with severe insomnia and you decided that examining the goodness of fit for individual items might prove to be a useful and non-habit-forming sleep remedy. If this were the case, then you could try the following post estimation syntax: estat eqgof Equation-level goodness of fit
Variance depvars
fitted
predicted
residual
R-squared
mc
mc2
s_felt
.7920552
.3264381
.4656171
.4121406
.6419818
.4121406
s_work
.7247448
.3346433
.3901015
.4617395
.6795142
.4617395
s_imp
.8099424
.3052489
.5046936
.3768772
.6139033
.3768772
s_int
.8190516
.5255869
.2934647
.6417018
.8010629
.6417018
s_job
.7373622
.1801084
.5572538
.2442604
.4942271
.2442604
observed
overall mc
.8105315
= correlation between depvar and its prediction
mc2 = mc^2 is the Bentler-Raykov squared multiple correlation coefficient
Above is the output for the one factor CFA model we examined at the outset. This output tells us several interesting things: 2
First, it provides the r values for each of the 5 observed items. Please note that this is simply squaring the standardized factor loading value for each of the items (i.e. in the case of “s_felt” with a standardized factor loading coefficient of 0. 6419, squaring this gives us .412), but might be useful to you. The “mc” and “mc2” values are squared multiple correlation coefficients, but 2 these values are identical to the standardized factor loading coefficient and the r value in recursive models. Additionally, the equation level variance is dec omposed into fitted, predicted, and residual variance for each item. Also, the coefficient of determination – which basically is like a full 2 model r is also provided as “overall”. It is the same value we got above in the GOF output. Anyone feeling somnolent now? 10
Traditional SEM notation
Some of you may be missing the old, traditional SEM notation. You may want to call a friend and talk this over. Another option would be to simply use the following post estimation syntax which gives you two pages of the “good stuff ”: estat framework Endogenous variables on endogenous variables observed Beta
s_felt
s_work
s_imp
s_int
s_job
observed s_felt
0
s_work
0
0
s_imp
0
0
0
s_int
0
0
0
0
s_job
0
0
0
0
0
e.s_imp
e.s_int
e.s_job
Exogenous variables on endogenous variables latent Gamma
INVOLVE
observed s_felt
1
s_work s_imp
1.01249 .9670003
s_int
1.268884
s_job
.7427909
Covariances of error variables
Psi
observed e.s_felt
e.s_felt
.4656171
e.s_work
observed e.s_work
0
.3901015
e.s_imp
0
0
.5046936
e.s_int
0
0
0
.2934647
e.s_job
0
0
0
0
.5572538
Intercepts of endogenous variables observed alpha
s_felt
s_work
s_imp
s_int
s_job
_cons
2.76
2.84
3.147586
2.750345
2.955862
11
Covariances of error variables observed Psi
e.s_felt
e.s_work
e.s_imp
e.s_int
e.s_job
e.s_felt
.4656171
e.s_work
0
.3901015
e.s_imp
0
0
.5046936
e.s_int
0
0
0
.2934647
e.s_job
0
0
0
0
.5572538
observed
Intercepts of endogenous variables observed alpha
s_felt
s_work
s_imp
s_int
s_job
_cons
2.76
2.84
3.147586
2.750345
2.955862
Covariances of exogenous variables latent Phi
INVOLVE
INVOLVE
.3264381
latent
Means of exogenous variables latent kappa
INVOLVE
mean
0
As you can see, you get a number of different estimation results in matrix format. Please note that there are several options that you can use to further specify this output. Type in the following command for more information: help estat framework In all, this is just the basic introduction to confirmatory factor anal ysis using Stata 12.1. Hopefully it has been helpful, but I strongly encourage you to consult the Stata 12.1 user’s manual for further detail and a plethora of additional options. After you get the basics, the best way to learn this stuff is to simply run models o ver and over while consulting the user’s manual, your colleagues, and the Stata list (http://www.stata.com/statalist/). 12