Data Processing
with Stata 14.1 Cheat Sheet For more info see Stata’s reference manual (stata.com)
Basic Syntax All Stata functions have the same format (syntax): [by varlist1:] command [varlist2] apply the command across each unique combination of variables in varlist1
Useful Shortcuts keyboard buttons F2 describe data Ctrl + 8 open the data editor clear delete data in memory
Ctrl + 9 open a new .do file Ctrl + D highlight text in .do file, then ctrl + d executes it in the command line
Arithmetic
PgUp
+ combine (strings)
Tab cls
scroll through previous commands
add (numbers)
autocompletes variable name after typing part
− subtract
clear the console (where results are displayed)
* multiply
Set up pwd print current (working) directory cd "C:\Program Files (x86)\Stata13" change working directory dir display filenames in working directory fs *.dta List all Stata data in working directory underlined parts are shortcuts – capture log close use "capture" close the log on any existing do files or "cap" log using "myDoFile.txt", replace create a new log file to record your work and results search mdesc packages contain find the package mdesc to install extra commands that expand Stata’s toolkit ssc install mdesc install the package mdesc; needs to be done once
Import Data sysuse auto, clear for many examples, we load system data (Auto data) use the auto dataset. use "yourStataFile.dta", clear load a dataset from the current directory frequently used commands are import excel "yourSpreadsheet.xlsx", /* highlighted in yellow */ sheet("Sheet1") cellrange(A2:H11) firstrow import an Excel spreadsheet import delimited"yourFile.csv", /* */ rowrange(2:11) colrange(1:8) varnames(2) import a .csv file webuse set "https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data" webuse "wb_indicators_long" set web-based directory and load data from the web
[=exp]
[if exp]
[in range]
column to save output as condition: only apply to apply a new variable apply the function specific rows command to if something is true
bysort rep78 : summarize
price
[weight]
[using filename]
[,options]
apply weights
pull data from a file (if not loaded)
special options for command
In this example, we want a detailed summary with stats like kurtosis, plus mean and median
if foreign == 0 & price <= 9000, detail
To find out more about any command – like what options it takes – type help command
AT COMMAND PROMPT PgDn
function: what are you going to do to varlists?
/ divide ^ raise to a power
Basic Data Operations Logic &
and ! or ~ not | or
== tests if something is equal = assigns a value to a variable
== equal != not or ~= equal
if foreign != 1 & price >= 10000 make Chevy Colt Buick Riviera Honda Civic Volvo 260
foreign 0 0 1 1
price 3,984 10,372 4,499 11,995
< less than <= less than or equal to > greater than >= greater or equal to if foreign != 1 | price >= 10000 make Chevy Colt Buick Riviera Honda Civic Volvo 260
foreign 0 0 1 1
price 3,984 10,372 4,499 11,995
Explore Data VIEW DATA ORGANIZATION
describe make price display variable type, format, and any value/variable labels count count if price > 5000 number of rows (observations) Can be combined with logic ds, has(type string) lookfor "in." search for variable types, variable name, or variable label isid mpg check if mpg uniquely identifies the data
SEE DATA DISTRIBUTION
codebook make price overview of variable type, stats, number of missing/unique values summarize make price mpg print summary statistics (mean, stdev, min, max) for variables inspect mpg show histogram of data, number of missing or zero observations histogram mpg, frequency plot a histogram of the distribution of a variable
BROWSE OBSERVATIONS WITHIN THE DATA
Missing values are treated as the largest
browse or Ctrl + 8 positive number. To exclude missing values, use the !missing(varname) syntax open the data editor list make price if price > 10000 & !missing(price) clist ... (compact form) list the make and price for observations with price > $10,000 display price[4] display the 4th observation in price; only works on single values gsort price mpg (ascending) gsort –price –mpg (descending) sort in order, first by price then miles per gallon duplicates report assert price!=. finds all duplicate values in each variable verify truth of claim levelsof rep78 display the unique values for rep78
Tim Essam (
[email protected]) • Laura Hughes (
[email protected]) follow us @StataRGIS and @flaneuseks
inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets)
Change Data Types
Stata has 6 data types, and data can also be missing: no data
true/false
numbers
words
missing string int long float double byte To convert between numbers & strings: gen foreignString = string(foreign) "1" 1 tostring foreign, gen(foreignString) "1" decode foreign , gen(foreignString) "foreign" 1
gen foreignNumeric = real(foreignString) "1" "1" destring foreignString, gen(foreignNumeric) encode foreignString, gen(foreignNumeric) "foreign"
recast double mpg generic way to convert between types
Summarize Data include missing values create binary variable for every rep78 value in a new variable, repairRecord
tabulate rep78, mi gen(repairRecord) one-way table: number of rows with each value of rep78 tabulate rep78 foreign, mi two-way table: cross-tabulate number of observations for each combination of rep78 and foreign bysort rep78: tabulate foreign for each value of rep78, apply the command tabulate foreign tabstat price weight mpg, by(foreign) stat(mean sd n) create compact table of summary statistics displays stats formats numbers for all data
table foreign, contents(mean price sd price) f(%9.2fc) row create a flexible table of summary statistics collapse (mean) price (max) mpg, by(foreign) replaces data calculate mean price & max mpg by car type (foreign)
Create New Variables
generate mpgSq = mpg^2 gen byte lowPr = price < 4000 create a new variable. Useful also for creating binary variables based on a condition (generate byte) generate id = _n bysort rep78: gen repairIdx = _n _n creates a running index of observations in a group generate totRows = _N bysort rep78: gen repairTot = _N _N creates a running count of the total observations per group pctile mpgQuartile = mpg, nq = 4 create quartiles of the mpg data see help egen egen meanPrice = mean(price), by(foreign) calculate mean price for each group in foreign for more options geocenter.github.io/StataTraining
Disclaimer: we are not affiliated with Stata. But we like it.
updated January 2016 CC BY 4.0
Data Transformation
with Stata 14.1 Cheat Sheet For more info see Stata’s reference manual (stata.com)
Select Parts of Data (Subsetting) SELECT SPECIFIC COLUMNS drop make remove the 'make' variable keep make price opposite of drop; keep only variables 'make' and 'price'
FILTER SPECIFIC ROWS
drop if mpg < 20 drop in 1/4 drop observations based on a condition (left) or rows 1-4 (right) keep in 1/30 opposite of drop; keep only rows 1-30 keep if inrange(price, 5000, 10000) keep values of price between $5,000 – $10,000 (inclusive) keep if inlist(make, "Honda Accord", "Honda Civic", "Subaru") keep the specified values of make sample 25 sample 25% of the observations in the dataset (use set seed # command for reproducible sampling)
webuse set https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data webuse "coffeeMaize.dta" load demo dataset
MELT DATA (WIDE → LONG)
reshape variables starting with coffee and maize
unique id create new variable which captures variable (key) the info in the column names
reshape long coffee@ maize@, i(country) j(year) new variable convert a wide dataset to long TIDY DATASETS have WIDE LONG (TIDY) each observation country year coffee maize melt coffee maize maize in its own row and country coffee 2011 2012 2011 2012 Malawi 2011 each variable in its Malawi 2012 Malawi Rwanda 2011 own column. Rwanda Rwanda Uganda Uganda
cast
Uganda
CAST DATA (LONG → WIDE)
2012 2011 2012
what will be create new variables unique id with the year added variable (key) to the column name
create new variables named coffee2011, maize2012...
reshape wide coffee maize, i(country) j(year) convert a long dataset to wide
When datasets are tidy, they have a consistent, standard format that is easier to manipulate and analyze.
xpose, clear varname transpose rows and columns of data, clearing the data and saving old column names as a new variable called "_varname"
Replace Parts of Data rename (rep78 foreign) (repairRecord carType) rename one or multiple variables
REPLACE MISSING VALUES
useful for cleaning survey datasets mvdecode _all, mv(9999) replace the number 9999 with missing value in all variables mvencode _all, mv(9999) useful for exporting data replace missing values with the number 9999 for all variables
Label Data Value labels map string descriptions to numbers. They allow the underlying data to be numeric (making logical tests simpler) while also connecting the values to human-understandable text. label define myLabel 0 "US" 1 "Not US" label values foreign myLabel define a label and apply it the values in foreign note: data note here place note in dataset
Tim Essam (
[email protected]) • Laura Hughes (
[email protected]) follow us @StataRGIS and @flaneuseks
id
+
blue
webuse coffeeMaize2.dta, clear save coffeeMaize2.dta, replace load demo data webuse coffeeMaize.dta, clear
pink id
CHANGE ROW VALUES
replace price = 5000 if price < 5000 replace all values of price that are less than $5,000 with 5000 recode price (0 / 5000 = 5000) change all prices less than 5000 to be $5,000 recode foreign (0 = 2 "US")(1 = 1 "Not US"), gen(foreign2) change the values and value labels then store in a new variable, foreign2
blue
pink
blue
pink
should contain the same variables (columns)
append using "coffeeMaize2.dta", gen(filenum) add observations from "coffeeMaize2.dta" to current data and create variable "filenum" to track the origin of each observation
MERGING TWO DATASETS TOGETHER must contain a common variable id blue pink (id) id brown
+
ONE-TO-ONE
=
id
blue
pink brown _merge 3 3 3
MANY-TO-ONE id
blue
pink
id brown
+
id
blue
pink brown _merge
=
_merge code 1 row only (master) in ind2 2 row only (using) in hh2 3 row in (match) both
3
.
3 1 3
.
.
.
3 1 2
webuse ind_age.dta, clear save ind_age.dta, replace webuse ind_ag.dta, clear merge 1:1 id using "ind_age.dta" one-to-one merge of "ind_age.dta" into the loaded dataset and create variable "_merge" to track the origin webuse hh2.dta, clear save hh2.dta, replace webuse ind2.dta, clear merge m:1 hid using "hh2.dta" many-to-one merge of "hh2.dta" into the loaded dataset and create variable "_merge" to track the origin
FUZZY MATCHING: COMBINING TWO DATASETS WITHOUT A COMMON ID reclink match records from different data sets using probabilistic matching ssc install reclink jarowinkler create distance measure for similarity between two strings ssc install jarowinkler
inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets)
FIND MATCHING STRINGS
display strmatch("123.89", "1??.?9") return true (1) or false (0) if string matches pattern display substr("Stata", 3, 5) return the string located between characters 3-5 list make if regexm(make, "[0-9]") list observations where make matches the regular expression (here, records that contain a number) list if regexm(make, "(Cad.|Chev.|Datsun)") return all observations where make contains "Cad.", "Chev." or "Datsun" compare the given list against the first word in make
TRANSFORM STRINGS
Combine Data id
GET STRING PROPERTIES
display length("This string has 29 characters") return the length of the string * user-defined package charlist make display the set of unique characters within a string display strpos("Stata", "a") return the position in Stata where a is first found
list if inlist(word(make, 1), "Cad.", "Chev.", "Datsun") return all observations where the first word of the make variable contains the listed words
ADDING (APPENDING) NEW DATA
CHANGE COLUMN NAMES
label list list all labels within the dataset
Manipulate Strings
Reshape Data
display regexr("My string", "My", "Your") replace string1 ("My") with string2 ("Your") replace make = subinstr(make, "Cad.", "Cadillac", 1) replace first occurrence of "Cad." with Cadillac in the make variable display stritrim(" Too much Space") replace consecutive spaces with a single space display trim(" leading / trailing spaces ") remove extra spaces before and after a string display strlower("STATA should not be ALL-CAPS") change string case; see also strupper, strproper display strtoname("1Var name") convert string to Stata-compatible variable name display real("100") convert string to a numeric or missing value
Save & Export Data
compress compress data in memory Stata 12-compatible file save "myData.dta", replace saveold "myData.dta", replace version(12) save data in Stata format, replacing the data if a file with same name exists export excel "myData.xls", /* */ firstrow(variables) replace export data as an Excel file (.xls) with the variable names as the first row export delimited "myData.csv", delimiter(",") replace export data as a comma-delimited file (.csv)
geocenter.github.io/StataTraining
Disclaimer: we are not affiliated with Stata. But we like it.
updated March 2016 CC BY 4.0
Data Visualization
BASIC PLOT SYNTAX:
with Stata 14.1 Cheat Sheet
y3
main plot-specific options; see help for complete set
graph bar (count), over(foreign, gap(*0.5)) intensity(*0.5) graph hbar draws horizontal bar charts
graph bar (percent), over(rep78) over(foreign)
23 20
graph hbar ...
(asis) • (percent) • (count) • over(
, ) • cw •missing • nofill • allcategories • percentages • stack • bargap(#) • intensity(*#) • yalternate • xalternate
DISCRETE X, CONTINUOUS Y
graph bar (median) price, over(foreign)
twoway pcspike wage68 ttl_exp68 wage88 ttl_exp88 (sysuse nlswide1)
Parallel coordinates plot
twoway scatter mpg weight, jitter(7)
twoway pccapsym wage68 ttl_exp68 wage88 ttl_exp88 Slope/bump plot (sysuse nlswide1)
vertical, • horizontal
half • jitter(#) • jitterseed(#) diagonal • [aweights()]
2
17 10
vertical • horizontal • headlabel
THREE VARIABLES twoway contour mpg price weight, level(20) crule(intensity)
twoway scatter mpg weight, mlabel(mpg) scatter plot with labelled values
3D contour plot
twoway connected mpg price, sort(price)
regress price mpg trunk weight length turn, nocons matrix regmat = e(V) ssc install plotmatrix plotmatrix, mat(regmat) color(green)
ccuts(#s) • levels(#) • minmax • crule(hue | chue| intensity) • scolor() • ecolor () • ccolors() • heatmap interp(thinplatespline | shepard | none)
jitter(#) • jitterseed(#) • sort • cmissing(yes | no) connect() • [aweight()]
scatter plot with connected lines and symbols jitter(#) • jitterseed(#) • sort see also connect() • cmissing(yes | no)
graph hbar ...
bar plot (asis) • (percent) • (count) • (stat: mean median sum min max ...)
over(, )>) • cw • missing • nofill • allcategories • percentages stack • bargap(#) • intensity(*#) • yalternate • xalternate
line
twoway area mpg price, sort(price) line plot with area shading
heatmap
dot plot (asis) • (percent) • (count) • (stat: mean median sum min max ...)
over(, )>) • cw • missing • nofill • allcategories • percentages linegap(#) • marker(#, ) • linetype(dot | line | rectangle) dots() • lines() • rectangles() • rwidth
graph hbox mpg, over(rep78, descending) by(foreign) missing graph box draws vertical boxplots box plot
over(, )>) • missing • allcategories • intensity(*#) • boxgap(#) medtype(line | line | marker) • medline() • medmarker()
ssc install vioplot
violin plot over(, )>) • nofill •
vertical • horizontal • obs • kernel() • bwidth(#) • barwidth(#) • dscale(#) • ygap(#) • ogap(#) • density() bar() • median() • obsopts()
JUXTAPOSE (FACET)
Plot Placement
twoway scatter mpg price, by(foreign, norescale)
SUPERIMPOSE
total • missing • colfirst • rows(#) • cols(#) • holes() compact • [no]edgelabel • [no]rescale • [no]yrescal • [no]xrescale [no]iyaxes • [no]ixaxes • [no]iytick • [no]ixtick [no]iylabel [no]ixlabel • [no]iytitle • [no]ixtitle • imargin()
twoway mband mpg weight || scatter mpg weight
sort • cmissing(yes | no) • vertical, • horizontal base(#)
plot median of the y values bands(#)
scatter y3 y2 y1 x, marker(i o i) mlabel(var3 var2 var1) plot several y values for a single x value
graph twoway scatter mpg price in 27/74 || scatter mpg price /* */ if mpg < 15 & price > 12000 in 27/74, mlabel(make) m(i)
Laura Hughes ([email protected]) • Tim Essam ([email protected]) follow us @flaneuseks and @StataRGIS
ssc install binscatter
binscatter weight mpg, line(none)
plot a single value (mean or median) for each x value
vertical, • horizontal • base(#) • barwidth(#)
medians • nquantiles(#) • discrete • controls() • linetype(lfit | qfit | connect | none) • aweight[]
FITTING RESULTS
twoway dot mpg rep78
twoway lfitci mpg weight || scatter mpg weight
dot plot
vertical, • horizontal • base(#) • ndots(#) dcolor() • dfcolor() • dlcolor() dsize() • dsymbol() dlwidth() • dotextend(yes | no)
calculate and plot linear fit to data with confidence intervals
level(#) • stdp • stdf • nofit • fitplot() • ciplot() • range(# #) • n(#) • atobs • estopts() • predopts()
twoway lowess mpg weight || scatter mpg weight
twoway dropline mpg price in 1/5
dropped line plot
calculate and plot lowess smoothing
twoway rcapsym length headroom price
calculate and plot quadriatic fit to data with confidence intervals
bwidth(#) • mean • noweight • logit • adjust
vertical, • horizontal • base(#)
twoway qfitci mpg weight, alwidth(none) || scatter mpg weight level(#) • stdp • stdf • nofit • fitplot() • ciplot() • range(# #) • n(#) • atobs • estopts() • predopts()
range plot (y1 ÷ y2) with capped lines vertical • horizontal
see also rcap
REGRESSION RESULTS
twoway rarea length headroom price, sort vertical • horizontal • sort cmissing(yes | no)
combine 2+ saved graphs into a single plot
combine twoway plots using ||
twoway bar price rep78
bar plot
range plot (y1 ÷ y2) with area shading
graph combine plot1.gph plot2.gph...
mat() • color() • freq
SUMMARY PLOTS
graph dot (mean) length headroom, over(foreign) m(1, ms(S))
vioplot price, over(foreign)
save
scatter plot of each combination of variables
jitter(#) • jitterseed(#) • sort • cmissing(yes | no) connect() • [aweight()]
(asis) • (percent) • (count) • over(, ) • cw •missing • nofill • allcategories • percentages • stack • bargap(#) • intensity(*#) • yalternate • xalternate
c
plot size
scatter plot
bar plot
b
annotations
xline(xint) yline(yint) text(y x "annotation") axes
graph matrix mpg price weight, half y2
kdensity mpg, bwidth(3)
a
facet
by(var)
TWO+ CONTINUOUS VARIABLES y1
bin(#) • width(#) • density • fraction • frequency • percent • addlabels addlabopts() • normal • normopts() • kdensity kdenopts()
grouped bar plot
[if],
custom appearance
histogram
DISCRETE
plot-specific options
[in]
scheme(s1mono) play(customTheme) xsize(5) ysize(4) saving("myPlot.gph", replace)
histogram mpg, width(5) freq kdensity kdenopts(bwidth(5))
bwidth • kernel( normal • normopts()
y1 y2 … yn x
titles
sysuse auto, clear
smoothed histogram
variables: y first
title("title") subtitle("subtitle") xtitle("x-axis title") ytitle("y axis title") xscale(range(low high) log reverse off noline) yscale()
For more info see Stata’s reference manual (stata.com)
ONE VARIABLE CONTINUOUS
graph
twoway rbar length headroom price range plot (y1 ÷ y2) with bars
vertical • horizontal • barwidth(#) • mwidth msize()
inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets)
regress price mpg headroom trunk length turn ssc install coefplot coefplot, drop(_cons) xline(0) Plot regression coefficients
baselevels • b() • at() • noci • levels(#) keep() • drop() • rename() horizontal • vertical • generate()
regress mpg weight length turn margins, eyex(weight) at(weight = (1800(200)4800)) marginsplot, noci Plot marginal effects of regression horizontal • noci
geocenter.github.io/StataTraining
updated February 2016
Disclaimer: we are not affiliated with Stata. But we like it.
CC BY 4.0
Plotting in Stata 14.1
Apply Themes
ANATOMY OF A PLOT title
annotation 200
Customizing Appearance
For more info see Stata’s reference manual (stata.com)
plots contain many features
y-axis
y-axis title
150
graph region inner graph region inner plot region
9
marker label
1
8
6
2
marker
7
grid lines
3 0
y-line
outer region inner region scatter price mpg, graphregion(fcolor("192 192 192") ifcolor("208 208 208"))
0
20
40
x-axis
specify the fill of the background in RGB or with a Stata color
scatter price mpg, plotregion(fcolor("224 224 224") ifcolor("240 240 240"))
60
80
x-axis title
LINES / BORDERS arguments for the plot objects (in green) go in the options portion of these commands (in orange)
SYNTAX
marker
for example: scatter price mpg, xline(20, lwidth(vthick))
mcolor(none)
COLOR
mcolor("145 168 208")
specify the fill and stroke of the marker in RGB or with a Stata color
mfcolor("145 168 208")
msize(medium)
SIZE / THICKNESSS
ehuge vhuge huge vlarge
color("145 168 208")
O
D
T
S
o
d
t
s
Oh
Dh
Th
Sh
oh
dh
th
sh
+
X
p
none i
jitter(#)
jitterseed(#)
randomly displace the markers
set seed
lcolor(none)
marker mlcolor("145 168 208")
tlcolor("145 168 208") glcolor("145 168 208")
lwidth(medthick)
marker
specify the thickness (stroke) of a line:
tick marks grid lines
vvvthick vvthick vthick thick medthick medium
medsmall small vsmall tiny vtiny
specify the marker symbol:
titles
marker label
lcolor("145 168 208")
specify the stroke color of the line or border
line
axes grid lines
solid dash dot
medthin thin vthin vvthin vvvthin none
lpattern(dash) glpattern(dash)
specify the line pattern
longdash
longdash_dot
shortdash
shortdash_dot
dash_dot
blank
axes off no axis/labels noline tick marks noticks tick marks tlength(2) grid lines nogrid nogmin nogmax axes
tick marks xlabel(#10, tposition(crossing)) number of tick marks, position (outside | crossing | inside)
Laura Hughes ([email protected]) • Tim Essam ([email protected]) follow us @flaneuseks and @StataRGIS
title(...) subtitle(...) xtitle(...) ytitle(...)
annotation
text(...)
specify the color of the text
USING A SAVED THEME
twoway scatter mpg price, scheme(customTheme) Create custom themes by
help scheme entries saving options in a .scheme file see all options for setting scheme properties adopath ++ "~//StataThemes" set path of the folder (StataThemes) where custom .scheme files are saved set as default scheme
set scheme customTheme, permanently change the theme axis labels
xlabel(...) ylabel(...) legend
legend(...)
install William Buchanan’s package to generate custom schemes and color palettes (including ColorBrewer)
USING THE GRAPH EDITOR
twoway scatter mpg price, play(graphEditorTheme)
color(none)
Select the Graph Editor
marker label mlabcolor("145 168 208")
labcolor("145 168 208")
axis labels
mlwidth(thin) tlwidth(thin) glwidth(thin)
Schemes are sets of graphical parameters, so you don’t have to specify the look of the graphs every time.
net inst brewscheme, from("https://wbuchanan.github.io/brewscheme/") replace
medium
msymbol(Dh)
APPEARANCE
medlarge
tick marks
axes
marker
tick marks
tick marks
TEXT
yscale(...) xline(...) xlabel(...) yline(...) legend ylabel(...) legend(region(...))
grid lines
specify the marker size:
large
POSITION
mfcolor(none)
specify the fill of the marker
line
100
y2 Fitted values
legend
specify the fill of the plot background in RGB or with a Stata color
SYMBOLS
line
5
4
50
y-axis labels
plot region
10
100
y-axis title
titles
subtitle
specify the size of the text: size(medsmall) marker label mlabsize(medsmall) axis labels labsize(medsmall)
Click Record
28 pt. vhuge 20 pt. 16 pt. 14 pt. 12 pt. 11 pt.
10 pt. medsmall 8 pt. small huge 6 pt. vsmall 4 pt. tiny vlarge 2 pt. half_tiny large 1.3 pt. third_tiny medlarge 1 pt. quarter_tiny medium 1 pt minuscule
Double click on symbols and areas on plot, or regions on sidebar to customize Unclick Record
marker label mlabel(foreign) label the points with the values of the foreign variable axis labels
nolabels
axis labels
format(%12.2f )
no axis labels change the format of the axis labels
legend
off
legend
label(# "label")
turn off legend
change legend label text
marker label mlabposition(5) label location relative to marker (clock position: 0 – 12)
inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets)
Save theme as a .grec file
Save Plots graph twoway scatter y x, saving("myPlot.gph") replace save the graph when drawing graph save "myPlot.gph", replace save current graph to disk graph combine plot1.gph plot2.gph... combine 2+ saved graphs into a single plot graph export "myPlot.pdf", as(.pdf) see options to set export the current graph as an image file size and resolution geocenter.github.io/StataTraining
Disclaimer: we are not affiliated with Stata. But we like it.
updated June 2016 CC BY 4.0
Examples use auto.dta (sysuse auto, clear) unless otherwise noted
univar price mpg, boxplot ssc install univar calculate univariate summary, with box-and-whiskers plot stem mpg return stem-and-leaf display of mpg frequently used commands are summarize price mpg, detail highlighted in yellow calculate a variety of univariate summary statistics for Stata 13: ci mpg price, level (99) ci mean mpg price, level(99) r compute standard errors and confidence intervals correlate mpg price return correlation or covariance matrix pwcorr price mpg weight, star(0.05) return all pairwise correlation coefficients with sig. levels mean price mpg estimates of means, including standard errors proportion rep78 foreign estimates of proportions, including standard errors for e categories identified in varlist ratio estimates of ratio, including standard errors total price estimates of totals, including standard errors
TIME SERIES OPERATORS L. F. D. S.
lag x t-1 lead x t+1 difference x t-x t-1 seasonal difference x t-xt-1
tabulate foreign rep78, chi2 exact expected tabulate foreign and repair record and return chi2 and Fisher’s exact statistic alongside the expected values ttest mpg, by(foreign) estimate t test on equality of means for mpg by foreign r prtest foreign == 0.5 one-sample test of proportions ksmirnov mpg, by(foreign) exact Kolmogorov-Smirnov equality-of-distributions test ranksum mpg, by(foreign) exact equality tests on unmatched data (independent samples) anova systolic drug webuse systolic, clear analysis of variance and covariance e pwmean mpg, over(rep78) pveffects mcompare(tukey) estimate pairwise comparisons of means with equal variances include multiple comparison adjustment
USEFUL ADD-INS
measure something
CATEGORICAL VARIABLES
identify a group to which an observations belongs
INDICATOR VARIABLES denote whether T F something is true or false
1900
SURVIVAL ANALYSIS
SURVEY DATA
stores results as e -class
more details at http://www.stata.com/manuals14/u25.pdf
o. #
omit a variable or indicator specify interactions
regress price io(2).rep78 regress price mpg c.mpg#c.mpg
specify rep78 variable to be an indicator variable set the third category of rep78 to be the base category set the base to most frequently occurring category for rep78 treat mpg as a continuous variable and specify an interaction between foreign and mpg set rep78 as an indicator; omit observations with rep78 == 2 create a squared mpg term to be used in regression
##
specify factorial interactions
regress price c.mpg##c.mpg
create all possible interactions with mpg (mpg and mpg2)
1970
1980
1990
webuse nhanes2b, clear
svyset psuid [pweight = finalwgt], strata(stratid) declare survey design for a dataset r svydescribe report survey data details svy: mean age, over(sex) estimate a population mean for each subpopulation svy, subpop(rural): mean age estimate a population mean for rural areas e svy: tabulate sex heartatk report two-way table with tests of independence svy: reg zinc c.age##c.age female weight rural estimate a regression using survey weights
regress price mpg weight, robust estimate ordinary least squares (OLS) model on mpg weight and foreign, apply robust standard errors regress price mpg weight if foreign == 0, cluster(rep78) regress price only on domestic cars, cluster standard errors rreg price mpg weight, genwt(reg_wt) estimate robust regression to eliminate outliers probit foreign turn price, vce(robust) ADDITIONAL MODELS estimate probit regression with built-in Stata principal components analysis pca command robust standard errors factor analysis factor poisson • nbreg count outcomes logit foreign headroom mpg, or censored data tobit estimate logistic regression and instrumental variables ivregress ivreg2 report odds ratios difference-in-difference bootstrap, reps(100): regress mpg /* rddiff sscuser-written install ivreg2 regression discontinuity */ weight gear foreign xtabond xtabond2 dynamic panel estimator estimate regression with bootstrapping psmatch2 propensity score matching jackknife r(mean), double: sum mpg synth synthetic control analysis Blinder-Oaxaca decomposition jackknife standard error of sample mean oaxaca EXAMPLE regress price i.rep78 regress price ib(3).rep78 fvset base frequent rep78 regress price i.foreign#c.mpg i.foreign
id 4
0
webuse drugtr, clear
DESCRIPTION specify indicators specify base indicator command to change base treat variable as continuous
id 3
2
stset studytime, failure(died) r declare survey design for a dataset stsum summarize survival-time data stcox drug age e estimate a cox proportional hazard model
OPERATOR i. ib. fvset c.
Tim Essam ([email protected]) • Laura Hughes ([email protected]) follow us @StataRGIS and @flaneuseks
4
1950
2-period lag x t-2 2-period lead x t+2 difference of difference xt-xt−1-(xt−1-xt−2) lag-2 (seasonal difference) xt−xt−2
id 2
0
compact time series into means, sums and end-of-period values tscollap carryforward carry non-missing values forward from one obs. to the next identify spells or runs in time series tsspell
Estimation with Categorical & Factor Variables CONTINUOUS VARIABLES
L2. F2. D2. S2.
1850
id 1
2
100
0
webuse nlswork, clear
xtset id year declare national longitudinal data to be a panel xtdescribe xtline plot report panel aspects of a dataset wage relative to inflation r xtsum hours summarize hours worked, decomposing standard deviation into between and within components xtline ln_wage if id <= 22, tlabel(#3) plot panel data as a line plot xtreg ln_w c.age##c.age ttl_exp, fe vce(robust) e estimate a fixed-effects model with robust standard errors 4
200
1 Estimate Models
Statistical Tests
PANEL / LONGITUDINAL
inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets)
2 Diagnostics
not appropriate after robust cluster( )
estat hettest test for heteroskedasticity r ovtest test for omitted variable bias vif report variance inflation factor dfbeta(length) Type help regress postestimation plots calculate measure of influence for additional diagnostic plots rvfplot, yline(0) avplots plot residuals plot all partialmpg rep78 against fitted regression leverage values plots in one graph Fitted values weight headroom
3 Postestimation
price
Summarize Data
webuse sunspot, clear
tsset time, yearly declare sunspot data to be yearly time series tsreport r report time series aspects of a dataset generate lag_spot = L1.spot create a new variable of annual lags of sun spots tsline plot Number of sunspots tsline spot plot time series of sunspots e arima spot, ar(1/2) estimate an auto-regressive model with 2 lags
price
Results are stored as either r -class or e -class. See Programming Cheat Sheet
TIME SERIES
price
For more info see Stata’s reference manual (stata.com)
By declaring data type, you enable Stata to apply data munging and analysis functions specific to certain data types
price
with Stata 14.1 Cheat Sheet
Declare Data
Residuals
Data Analysis
commands that use a fitted model
regress price headroom length Used in all postestimation examples display _b[length] display _se[length] return coefficient estimate or standard error for mpg from most recent regression model margins, dydx(length) returns e-class information when post option is used return the estimated marginal effect for mpg r margins, eyex(length) return the estimated elasticity for price predict yhat if e(sample) create predictions for sample on which model was fit predict double resid, residuals calculate residuals based on last fit model test mpg = 0 r test linear hypotheses that mpg estimate equals zero lincom headroom - length test linear combination of estimates (headroom = length) geocenter.github.io/StataTraining
Disclaimer: we are not affiliated with Stata. But we like it.
updated June 2016 CC BY 4.0
Programming
with Stata 14.1
Cheat Sheet
For more info see Stata’s reference manual (stata.com)
1 Scalars
both r- and e-class results contain scalars
scalar x1 = 3 create a scalar x1 storing the number 3 scalar a1 = “I am a string scalar” create a scalar a1 storing a string
2 Matrices
Scalars can hold numeric values or arbitrarily long strings
DISPLAYING & DELETING BUILDING BLOCKS
[scalar | matrix | macro | estimates] [list | drop] b list contents of object b or drop (delete) object b [scalar | matrix | macro | estimates] dir list all defined objects for that class matrix list b matrix dir scalar drop x1 list contents of matrix b list all matrices delete scalar x1
GLOBALS
public or private variables storing text
available through Stata sessions
LOCALS
R- AND E-CLASS: Stata stores calculation results in two* main classes:
r
return results from general commands such as summary or tabulate
e
return results from estimation commands such as regress or mean
To assign values to individual variables use: r individual numbers or strings e rectangular array of quantities or expressions e pointers that store text (global or local)
1 SCALARS 2 MATRICES 3 MACROS
Loops: Automate Repetitive Tasks
ANATOMY OF A LOOP
objects to repeat over temporary variable used only within the loop requires local macro notation
* there’s also s- and n-class
PUBLIC
available only in programs, loops, or .do files PRIVATE
mean price e ereturn list returns list of scalars, macros, matrices and functions
summarize price, detail r return list returns a list of scalars
scalars: r(N) r(mean) r(Var) r(sd)
= = = =
74 6165.25... 86995225.97... 2949.49... ...
Results are replaced each time an r-class / e-class command is called
scalars:
e(df_r) e(N_over) e(N) e(k_eq) e(rank)
= = = = =
73 1 73 1 1
generate meanN = e(N) create a new variable equal to obs. in estimation command preserve create a temporary copy of active dataframe restore points to test restore restore temporary copy to original point set code that changes data generate p_mean = r(mean) create a new variable equal to average of price
ACCESSING ESTIMATION RESULTS
After you run any estimation command, the results of the estimates are stored in a structure that you can save, view, compare, and export
regress price weight Use estimates store estimates store est1 to compile results store previous estimation results est1 in memory for later use ssc install estout eststo est2: regress price weight mpg eststo est3: regress price weight mpg foreign estimate two regression models and store estimation results estimates table est1 est2 est3 print a table of the two estimation results est1 and est2
EXPORTING RESULTS
see also while
Stata has three options for repeating commands over lists or values: foreach, forvalues, and while. Though each has a different first line, the syntax is consistent: foreach x of varlist var1 var2 var3 {
Many Stata commands store results in types of lists. To access these, use return or ereturn commands. Stored results can be scalars, macros, matrices or functions.
matrix ad2 = a , d matrix ad1 = a \ d row bind matrices column bind matrices matselrc b x, c(1 3) findit matselrc select columns 1 & 3 of matrix b & store in new matrix x mat2txt, matrix(ad1) saving(textfile.txt) replace export a matrix to a text file ssc install mat2txt
global pathdata "C:/Users/SantasLittleHelper/Stata" define a global variable called pathdata cd $pathdata add a $ before calling a global macro change working directory by calling global macro global myGlobal price mpg length summarize $myGlobal summarize price mpg length using global
basic components of programming
4 Access & Save Stored r- and e-class Objects
e-class results are stored as matrices
matrix a = (4\ 5\ 6) matrix b = (7, 8, 9) create a 3 x 1 matrix create a 1 x 3 matrix matrix d = b' transpose matrix b; store in d
3 Macros
Building Blocks
The estout and outreg2 packages provide numerous, flexible options for making tables after estimation commands. See also putexcel command.
command `x', option ... }
open brace must appear on first line
command(s) you want to repeat can be one line or many
close brace must appear on final line by itself
FOREACH: REPEAT COMMANDS OVER STRINGS, LISTS, OR VARIABLES foreach x in|of [ local, global, varlist, newlist, numlist ] { list types: objects over which the Stata commands referring to `x' commands will be repeated } loops repeat the same command STRINGS over different arguments: sysuse "auto.dta", clear foreach x in auto.dta auto2.dta { same as... tab rep78, missing sysuse "`x'", clear tab rep78, missing sysuse "auto2.dta", clear } tab rep78, missing LISTS foreach x in "Dr. Nick" "Dr. Hibbert" { display length("Dr. Nick") display length ( "` x '" ) display length("Dr. Hibbert") } When calling a command that takes a string, surround the macro name with quotes.
VARIABLES foreach x in mpg weight { summarize `x' }
must define list type
foreach x of varlist mpg weight { summarize `x' }
• foreach in takes any list as an argument with elements separated by spaces • foreach of requires you to state the list type, which makes it faster
summarize mpg summarize weight
FORVALUES: REPEAT COMMANDS OVER LISTS OF NUMBERS iterator
forvalues i = 10(10)50 { display `i' numeric values over which loop will run }
DEBUGGING CODE
Use display command to show the iterator value at each step in the loop
ITERATORS
display 10 display 20 ...
i = 10/50 10, 11, 12, ... i = 10(10)50 10, 20, 30, ... i = 10 20 to 50 10, 20, 30, ...
local myLocal price mpg length esttab est1 est2, se star(* 0.10 ** 0.05 *** 0.01) label see also capture and scalar _rc set trace on (off ) create local variable called myLocal with the create summary table with standard errors and labels trace the execution of programs for error checking strings price mpg and length esttab using “auto_reg.txt”, replace plain se sysuse auto, clear PUTTING IT ALL TOGETHER summarize ` myLocal ' add a ` before and a ' after local macro name to call export summary table to a text file, include standard errors pull out the first word summarize contents of local myLocal generate car_make = word(make, 1) from the make variable outreg2 [est1 est2] using “auto_reg2.txt”, see replace levelsof rep78, local(levels) calculate unique groups of export summary table to a text file using outreg2 syntax levelsof car_make, local(cmake) define the create a sorted list of distinct values of rep78, car_make and store in local cmake local i to be local i = 1 store results in a local macro called levels an iterator Additional Programming Resources store the length of local local cmake_len : word count `cmake' local varLab: variable label foreign can also do with value labels cmake in local cmake_len store the variable label for foreign in the local varLab bit.ly/statacode foreach x of local cmake { download all examples from this cheat sheet in a .do file display in yellow "Make group `i' is `x'" TEMPVARS & TEMPFILES special locals for loops/programs ssc install adolist adoupdate adolist if `i' == `cmake_len' { initialize a new temporary variable called temp1 tempvar temp1 Update user-written .ado files List/copy user-written .ado files tests the position of the display "The total number of groups is `i'" save squared mpg values in temp1 generate `temp1' = mpg^2 iterator, executes contents net install package, from (https://raw.githubusercontent.com/username/repo/master) in brackets when the summarize the temporary variable temp1 } summarize `temp1' install a package from a Github repository condition is true local i = `++i' increment iterator by one tempfile myAuto create a temporary file to see also https://github.com/andrewheiss/SublimeStataEnhanced } save `myAuto' be used within a program tempname configure Sublime text for Stata 11-14 Tim Essam ([email protected]) • Laura Hughes ([email protected]) follow us @StataRGIS and @flaneuseks
inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets)
geocenter.github.io/StataTraining
Disclaimer: we are not affiliated with Stata. But we like it.
updated June 2016
CC BY 4.0