[QMM] Statistical formulas
1. Mean The mean, or average, of a collection of numbers x1 , x2 , . . . , xN is x ¯=
x1 + x2 + N
··· + x
N
=
1 N
xi .
2. Standard Standard deviation deviation The standard deviation is defined as S =
(
x1
2
− x¯)
+
2
· · · + (x − x¯) N − 1 N
1 = ( N
−1
xi
2
− x¯)
.
One may find in some textbooks an alternative version, with N in the denomi denominat nator. or. When When the author author wishes wishes to distinguish between between both b oth versions, versions, the ‘ N ’ version is presented as the population standard deviation, while the ‘N 1’ is the sample standard deviation.
−
3. The normal distribution The normal density curve is given by a function of the form f (x) =
1 exp 2πσ
√
2
(x
− 2−σ µ) 2
.
In this formula, µ y σ are two parameters which are different for each application of the model. A normal normal densit density y curve curve has a bell shape (Figure (Figure 1). The paramet parameter er µ, called the population mean, has an straightforward interpretation: the density curve peaks at x = µ. The parameter σ , called the population standard deviation measures the spread of the distribution: the higher σ , the flatter the bell. The case µ = 0, σ = is called the standard normal. Probabilities for the normal distribution are calculated as (numerical) integrals of the density. For most people, the only probability needed is
p µ
− 1.96σ < X < µ + 1.96σ = 0.95.
This This formu formula la provid provides es us with with an inter interv val which which contain containss 95% of the populatio population. n. The “tails” “tails” contain the remaining 5%.
4. Confidence Confidence limits for for the mean The formula for the 95% confidence limits for the mean is ¯ x [QMM] Statistical formulas 1
± 1.96 √ S N . 2012–03–01
5 . 0
4 . 0
3 . 0
2 . 0
1 . 0
0 . 0
−4
−2
0
2
4
Figure 1. Three normal density curves
Here, N is the number of data points, x¯ the sample mean and S the sample standard deviation. Textbooks recommend replacing the factor 1.96, derived from the normal distribution, by a factor taken from the Student t distribution, but the correction becomes irrelevant when N is high.
5. Correlation For two dimensional data ( x1 , y1 ), (x2 , y2 ), . . . , (xN , yN ), the (linear) correlation is
R
− ¯)( − ¯ = − ¯ − ¯ xi
xi
Always
x yi
x
2
y
yi
y
2
.
−1 ≤ R ≤ 1. 6. Coefficients of the regression line
Given N data points (x1 , y1 ), ( x2 , y2 ), . . . , ( xN , yN ), the regression line has an equation y = b0 +b1 x, in which b0 and b1 are the regression coefficients: b1 is the slope, and b0 the intercept. The formulas are b1 = R
S Y , S X
b0 = y¯
− b x¯. 1
R is the linear correlation. y¯ and x ¯ are the means of Y and Y , respectively. S Y and S X are the
standard deviations.
7. R square statistic In a linear regression equation, the R2 statistic is the proportion of the total variability of the [QMM] Statistical formulas 2
2012–03–01
3
3 G GG G G
2
2
G
G G G
GG G
G G G G G
G
GG G G G G G G G G G G G G G G GG G G G G G G G GG G G G G G G G G G G G GG G G G G G G G G GG G G GG G G G G G G G GG G G G G G G G GG G G G G G G G GG G G G G G GG G G G G G G G G G G G GG G GG G G G G G G G G G G G GG G GG G G G GG G G GG G G G G
G
G G GG
G G G G
1 −
G G
G G
G G
G G
GG
GG
G
2 −
G
1
G
G G G
G G G
G G
G
G
G
G G
G
0
G
G
G
G G G
G
G G GG G G G G G G G GG G
G
1 −
G
G G
G
G
GG G G
G G
G G G
G
G
G
G
G
G
G G G
G G
GG G
G
G G
G
G
G G
GG
G
G G
G G
G G
G
G G
G
G
G
G
G
G GG G
G
G G G
G G
GG G
G
G
G
G
G
G G
GG
G G
G
G G
G
G G
G
G
G
G
G G G G
G
G G G G
G G
G
G G
G
G G
G
G
G
G
G
G
G
G
G G
G GG
G
2 −
G
G G
G
G
G G
G
G GGGG G
G
G G
G G
G
G
G
G
G G
G G G
G G
G
G GG GG
G
G
GG
G G
G G
G
G
G
G
0
G
G
G
G
G
G
1
G G
G
G
G
G G
G
G G
G
G
G
G
3 −
3 −
G
−3
−2
−1
0
1
2
3
−3
−2
Figure 2. Regression lines with R = 0.8 and R =
−1
0
1
2
3
−0.2
dependent variable explained by the equation R2 =
Explained variability . Total variability
More explicitly, if y1 , y2 , . . . , yN are the observed valued of the dependent variable Y , with mean y¯ and yˆ1 , yˆ2 , . . . , yˆN are the values predicted by the equation, R
2
ˆ − ¯ = − ¯ yi
y
yi
y
2 2
.
Always 0 1. In simple regression (a single independent variable), R2 coincides with the R2 square of the correlation.
≤
≤
8. Adjusted R square An adjusted R2 statistic, defined as 2
Adjusted R2 = 1
− 1) , − (1 −N R−)( p N −1
is used sometimes to compare regression equations. N is the number of data points and p the number of independent variables in the equation. The adjustment becomes irrelevant when N is high.
[QMM] Statistical formulas 3
2012–03–01