CHAPTER
4
Descriptive Statistics Chapter Contents 4.1 Numerical Description 4.2 Central Tendency 4.3 Dispersion 4.4 Standardized Data 4.5 Percentiles, Quartiles, and Box Plots 4.6 Correlation 4.7 Grouped Data 4.8 Skewness and Kurtosis
Chapter Learning Objectives When you finish this chapter you should be able to • Explain the concepts of central tendency, dispersion, and shape. • Use Excel to obtain descriptive statistics and visual displays. • Calculate and interpret common descriptive statistics. • Identify the properties of common measures of central tendency. • Calculate and interpret common measures of dispersion. • Transform a data set into standardized values. • Apply the Empirical Rule and recognize outliers. • Calculate quartiles and other percentiles. • Make and interpret box plots. • Calculate and interpret a correlation coefficient. • Calculate the mean and standard deviation from grouped data. • Explain the concepts of skewness and kurtosis. 114
The last chapter explained visual descriptions of data (e.g., histograms, dot plots, scatter plots). This chapter explains numerical descriptions of data. Descriptive measures derived from a sample (n items) are statistics, while for a population ( N items or infinite) they are parameters. For a sample of numerical data, we are interested in three key characteristics: central tendency, dispersion, and shape. Table 4.1 summarizes the questions that we will be asking about the data.
4.1 NUMERICAL DESCRIPTION VS Chapter 1
Characteristic
5
Interpretation
Central Tendency
Where are the data values concentrated? What seem to be typical or middle data values?
Dispersion
Howmuchvariationisthereinthedata?How spread out are the data values? Are there unusual values?
Shape
Arethedatavaluesdistributedsymmetrically?Skewed? Sharply peaked? Flat? Bimodal?
Every year, J.D. Power and Associates issues its initial vehicle quality ratings. These ratings are of interest to consumers, dealers, and manufacturers. Table 4.2 shows defect rates for a sample of 37 vehicle brands. We will demonstrate how numerical statistics can be used to summarize a data set like this. The brands represented are a random sample that we will use to illustrate certain calculations.
TABLE 4.1 Characteristics of Numerical Data
EXAMPLE Vehicle Quality
115
116
Applied Statistics in Business and Economics
TABLE 4.2 Brand Acura Audi BMW Buick Cadillac Chevrolet Chrysler Dodge Ford GMC Honda HUMMER Hyundai
Number of Defects per 100 Vehicles, 2006 Model Year Defects
Brand
120 130 142 134 117 124 120 132 127 119 110 171 102
Defects
Infiniti Isuzu Jaguar Jeep Kia LandRover Lexus Lincoln Mazda Mercedes-Benz Mercury MINI Mitsubishi
JDPower
Brand
117 191 109 153 136 204 93 121 150 139 129 150 135
Defects
Nissan Pontiac Porsche Saab Saturn Scion Subaru Suzuki Toyota Volkswagen Volvo
121 133 91 163 129 140 146 169 106 171 133
Source: J.D. Power and Associates 2006 Initial Quality Study™. Used with permission.
NOTE: Ratings are intended for educational purposes only, and should not be used as a guide to consumer decisions.
2
Preliminary Analysis Before calculating any statistics, we consider how the data were collected. A Web search reveals that J.D. Power and Associates is a well-established independent company whose methods are widely considered to be objective. Data on defects are obtained by inspecting randomly chosen vehicles for each brand, counting the defects, and dividing the number of defects by the number of vehicles inspected. J.D. Power multiplies the result by 100 to obtain defects per 100 vehicles, rounded to the nearest integer. However, the underlying measurement scale iscontinuous (e.g., if 4 defects were found in 3 Saabs, the defect rate would be 1.333333, or 133 defects per 100 vehicles). Defect rates would vary from year toyear, and perhaps even withina given model year, so the timing of the study could affect the results. Since theanalysis is based on sampling, we must allow for the possibility of sampling error. With these cautions in mind, we look at the data. The dot plot, shown in Figure 4.1, offers a visual impression of the data. A dot plot can also reveal extreme data values.
FIGURE 4.1 Dot Plot of J.D. Power Data (n 37 brands) JDPower
90
100
110
120
130
140
150
160
170
180
190
200
210
=
Defects Per 100 Vehicles
Sorting
A good first step is to sort the data. Except for tiny samples, this would be done in Excel, as illustrated in Figure 4.2. Highlight the data array (including the headings), right click, choose Sort > Custom Sort, choose the column to sort on, and click OK. Table 4.3 shows the sorted data for all 37 brands.
VS Chapter 1
Visual Displays The sorted data in Table 4.3 provide insight into central tendency and
dispersion. The values range from 91 (Porsche) to 204 (Land Rover) and the middle values seem to be around 130. The next visual step is a histogram, shown in Figure 4.3. Sturges’ Rule suggests 5 bins, but we use 7 and 12 bins to show more detail. Both histograms are roughly symmetric (maybe slightly right-skewed) with no extreme values. Both show modal classes near 130.
Chapter 4 Descriptive Statistics
117
FIGURE 4.2 Sorting Data in Excel JDPower
Brand
Defects
Brand
Defects
Brand
TABLE 4.3
Defects
Porsche Lexus Hyundai Toyota Jaguar Honda
91 93 102 106 109 110
Chevrolet Ford Mercury Saturn Audi Dodge
124 127 129 129 130 132
BMW Subaru Mazda MINI Jeep Saab
Cadillac Infiniti GMC Acura Chrysler Lincoln Nissan
117 117 119 120 120 121 121
Pontiac Volvo Buick Mitsubishi Kia Mercedes-Benz Scion
133 133 134 135 136 139 140
Suzuki HUMMER Volkswagen Isuzu Land Rover
Number of Defects per 100 Vehicles (2006 Model Year) Ranked Lowest to Highest JDPower
142 146 150 150 153 163
Source: J.D. Power and Associates 2006 Initial Quality StudyTM. Used with permission.
169 171 171 191 204
FIGURE 4.3 Histograms of J.D. Power Data (n bins 7
=
37 brands)
JDPower
12 bins
Histogram of Defects Per 100 Vehicles
Histogram of Defects Per 100 Vehicles
18 16 14 y 12 c n e 10 u q 8 e r 6 F 4 2 0
9 8 7 y6 c n e5 u q4 e r 3 F 2 1 0 80
100
120 140 160 180 Defects Per 100 Vehicles
200
220
100
120 140 160 180 Defects Per 100 Vehicles
200
118
Applied Statistics in Business and Economics
When we speak of central tendency we are trying to describe the middle or typical values of a distribution. You can assess central tendency in a general way from a dot plot or histogram, but numerical statistics allow more precise statements. Table 4.4 lists six common measures of central tendency. Each has strengths and weaknesses. We need to look at several of them to obtain a clear picture of central tendency.
4.2 CENTRAL TENDENCY
Mean
VS
The most familiar statistical measure of central tendency is the mean. It is the sum of the data values divided by number of data items. For a population we denote µ it, while for a sample we call it x¯ . We use equation 4.1 to calculate the mean of a population:
Chapter 1
N
xi i 1
=
µ
(4.1)
(population definition)
N Since we rarely deal with populations, the sample notation of equation 4.2 is more commonly seen: n xi
=
¯ = i =n1
x
(4.2)
(sample definition)
We calculate the mean by using Excel’s function=AVERAGE(Data) where Data is an array containing the data. So for the sample of n 37 car brands: n
x
¯=
i 1
=
xi
= = 91 + 93 + 102 +···+
37
171
+ 191 + 204 = 4977 = 134.51 37
Characteristics of the Mean The arithmetic mean is the “average” with which most of us are familiar. The mean is affected by every sample item. It is the balancing point or fulcrum in a distribution if we view the X-axis as a lever arm and represent each data item as a physical weight, as illustrated in Figure 4.4 for the J.D. Powers data.
TABLE 4.4 Statistic
Six Measures of Central Tendency Formula
ExcelFormula
Pro
Con
Mean
1 n xi n i =1
=AVERAGE(Data)
Median
Middlevaluein sortedarray
=MEDIAN(Data)
Robust when extremedata values exist.
Mode
Mostfrequently occurringdatavalue
=MODE(Data)
Useful for attribute May not be unique, and dataordiscrete isnothelpfulfor data with a small continuous data. range.
Midrange
xmin
=0.5*(MIN(Data) +MAX(Data))
Easy to understand and calculate.
=GEOMEAN(Data)
Useful for growth Less familiar and rates and mitigates requires positive data. high extremes. Mitigates effects of Excludes some data extreme values. values that could be relevant.
+ xmax 2
Geometric mean (G ) Trimmed mean
√n x1 x2 · ·· xn Sameasthemean exceptomithighestand kof lowestdata % values (e.g., 5%)
=TRIMMEAN(Data, Percent)
Familiar and uses all thesample information.
Influenced by extreme values. Ignores extremes and canbeaffectedbygaps in data values.
Influenced by extreme values and ignores most data values.
132
Applied Statistics in Business and Economics
Mean 107.25 Median 106 Mode 95 Midrange 114 G. mean 102.48 T. mean 106
=
= =
= = =
4.10 On Friday night, the owner of Chez Pierre in downtown Chicago noted the amount spent for dinner at 28 four-person tables. (a) Use Excel or MegaStat to find the mean, midrange, geometric mean, and 10 percent trimmed mean (i.e., dropping the first three and last three observations). (b) Do these measures of central tendency agree? Explain. Dinner 95
103
109
170
114
113
124
105
80
104
84
176
69
95
134
68
95
61
108
61
150
107 115
160
52
128
87
136
4.11 An executive’s telephone log showed the lengths of 65 calls initiated during the last week of July. (a) Use Excel to find the mean, midrange, geometric mean, and 10 percent trimmed mean (i.e., dropping the first seven and last seven observations). (b) Do the measures of central tendency agree? Explain. (c) Are the data symmetric or skewed? If skewed, which direction? (d) Note strengths or weaknesses of each statistic of central tendency for the data. CallLength
G
= .399 or 39.9%
4.3 DISPERSION
1
2
10
5
3
3
2
20 1
6
3
13
2
2
1
26
3 1
1
2
1 7
1
4
1
4
2 2 2 2
1
3
3
6 1
2
13
13
1
2
9
1
12
1 1 8 1
3
3
3 1 2
11 1
1
1
6
5 1
1 5 1
2 18
6
4.12 The number of Internet users in Latin America grew from 15.8 million in 2000 to 60.6 million in 2004. Use the geometric mean to find the mean annual growth rate. (Data are from George E. Belch and Michael A. Belch, Advertising and Promotion [Irwin, 2004], p. 488.)
We can use a statistic such as the mean to describe the center of a distribution. But it is just as important to look at how individual data values are dispersed around the mean. For example, if two NYSE stocks A and B have the same mean return over the last 100 trading days, but A has more day-to-day variation, then the portfolio manager who wants a stable investment would prefer B. Consider possible sample distributions of study time spent by several college students taking an economics class: Low Dispersion
0
2 4 6 8 Study Hours Per Week
More Dispersion
02468 Study Hours Per Week
High Dispersion
02468 Study Hours Per Week
Each diagram has the same mean, but they differ in dispersion around the mean. The problem is: how do we describe dispersion in a sample? Since different variables have different means and different units of measurements (dollars, pounds, yen) we are looking for measures of dispersion that can be applied to many situations. Histograms and dot plots tell us something about variation in a data set (the “spread” of data points about the center) but formal measures of dispersion are needed. Table 4.11 lists several common measures of dispersion. All formulas shown are for sample data sets.
Range The range is the difference between the largest and smallest observation: (4.8)
Range
=x −x max
For the P/E data the range is: Range
min
= 91 − 7 = 84
A drawback of the range is that it only considers the two extreme data values. It seems desirable to seek a broad-based measure of dispersion that is based on all the data values x1, x2, . . . , xn.
Chapter 4 Descriptive Statistics
145
FIGURE 4.26 Possible Quartile Positions Q1
Q2
Q1
Q2
Q3
Q1
Q2
Q3
Q1
Q3
Q2
observations less than Q1 observations less than Q3 but greater than Q2
Q3
observations less than Q2 but greater than Q1 observations greater thanQ3
5
A financial analyst has a portfolio of 12 energy equipment stocks. She has data on their recent price/earnings (P/E) ratios. To find the quartiles, she sorts the data, finds Q2 (the median) halfway between the middle two data values, and then finds Q1 and Q3 (medians of the lower and upper halves, respectively) as illustrated in Figure 4.27.
FIGURE 4.27
Method of Medians
Co mp an y
So rt edP / E
Maverick Tube BJ Services FMC Technologies Nabors Industries
7 22 25 29
Baker Hughes Varco International National-Oilwell Smith International Cooper Cameron Schlumberger Halliburton Transocean
31 35 36 36 39 42 46 49
Q1 is between x3 and x4 so Q1 ( x3 x4)/2 (25 29)/2 27.0
Q2 is between x6 and x7 so Q2 ( x6 x7)/2 (35 36)/2 35.5
Q3 is between x9 and x10 so Q3 ( x9 x10)/2 (39 42)/2 40.5
2
Source: Data are from BusinessWeek, November 22, 2004, pp. 95–98.
Formula Method Statistical software (e.g., Excel, MegaStat, MINITAB) will not use the method of medians, but instead will use a formula to calculate the quartile positions.* There are several possible ways of calculating the quartile positions. The two that you are most likely to use are: Method A (MINITAB) Method B (Excel or MegaStat) Position of Q1 Position of Q2
0.25n 0.50n
0.25n 0.50n
+ 0.25 + 0.50 0.75n + 0.75
+ 0.75 + 0.50 0.75n + 0.25
Position of Q3 Q2 is the same using either method, but Q1 and Q3 generally are not. It depends on the gap between data values when interpolation is necessary. Most textbooks prefer MINITAB’s method, so it will be illustrated here. Method A defines theP th percentile *The quartiles (25th, 50th, and 75th percentiles) are a special case of percentiles. position as P(n 1)/100 while Method B defines it as 1 P(n 1)/100. See Eric Langford, “Quartiles in Elementary Statistics,”Journal of Statistics Educa tion 14, no. 3, November, 2006.
+
+
−
EXAMPLE Method of Medians
146
Applied Statistics in Business and Economics
5 EXAMPLE
Formula Method
Figure 4.28 illustrates the quartile calculations for the same sample of P/E ratios using Method A. The resulting quartiles are similar to those using the method of medians.
FIGURE 4.28 Formula Interpolation Method Com p an y
S o rt e dP / E
Maverick Tube BJ Services FMC Technologies Nabors Industries Baker Hughes Varco International National-Oilwell Smith International Cooper Cameron Schlumberger Halliburton Transocean
7 22 25 29 31 35 36 36 39 42 46 49
Q1 is at observation 0.25n 0.25 (0.25)(12) 0.25 3.25, so we interpolate between x3 and x4 to get Q1 x3 (0.25)(x4 x3) 25 (0.25)(29 25) 26.00 Q2 is at observation 0.50n 0.50 (0.50)(12) 0.50 6.50,
so we interpolate between x6 and x7 to get Q2 x6 (0.50)(x7 x6) 35 (0.50)(36 35) 35.50 Q3 is at observation 0.75n 0.75 (0.75)(12) 0.75 9.75, so we interpolate between x9 and x10 to get Q3 x9 (0.75)(x10 x9) 39 (0.75)(42 39) 41.25
2
Source: Data are from BusinessWeek, November 22, 2004, pp. 95–98.
Excel Quartiles Excel’s function =QUARTILE(Data, k) returns the kth quartile, so =QUARTILE(Data, 1) would return Q1 and =QUARTILE(Data, 3) would return Q3. Excel treats quartiles as a special case of percentiles, so you could get the same results by using the function=PERCENTILE(Data, Percent). For example, =PERCENTILE(Data, 0.75) would return the 75th percentile or Q3.
5 EXAMPLE
P/E Ratios and Quartiles
A financial analyst has a diversified portfolio of 68 stocks. Their recent P/E ratios are shown. She wants to use the quartiles to define benchmarks for stocks that are low-priced (bottom quartile) or high-priced (top quartile). 7 8 8 10 10 10 10 12 13 13 13
PERatios2 13 13 13
13
14
14
14
15
15
15
15
15
16
16
16
17
18
18
18
18
19
19
19
19
19
20
20
20
21
21
21
22
22
23
23
23
24
25
26
26
26
26
27
29
29
30
31
34
36
37
40
41
45
48
55
68
91
Using Excel’s method of interpolation ( Method B), the quartile positions are:
Q1 position: 0.25(68)
(interpolate between x17
Q2 position: 0.50(68)
(interpolate between x34
Q3
(interpolate between x51
+ 0.75 = 17.75 + 0.50 = 34.50 position: 0.75(68) + 0.25 = 51.25
The quartiles are:
First quartile: Q1 Second quartile: Q2 Third quartile: Q3
+x +x +x
18) 35) 52)
= x + 0.75(x − x ) = 14 + 0.75(14 − 14) = 14 = x + 0.50(x − x ) = 19 + 0.50(19 − 19) = 19 = x + 0.25(x − x ) = 26 + 0.25(26 − 26) = 26 17
18
17
34
35
34
51
52
51
The median stock has a P/E ratio of 19. A stock with a P/E ratio below 14 is in the bottom quartile, while a stock with a P/E ratio above 26 is in the upper quartile. These statements are easy to understand, and convey an impression both of central tendency and dispersion in the sample. But notice that the quartiles do not provide clean cut-points between groups of observations because of clustering of identical data values on either side of the quartiles (a common occurrence). Since stock prices vary with the stage of the economic cycle, portfolio analysts must revise their P/E benchmarks continually and would actually use a larger sample (perhaps even all publicly traded stocks).
2
Chapter 4 Descriptive Statistics
147
Tip Whether you use the method of medians or Excel, your quartiles will be about the same. Small differences in calculation techniques typically do not lead to different conclusions in business applications.
Quartiles are robust statistics that generally resist outliers. However, quartiles do not always provide clean cutpoints in the sorted data, particularly in small samples or when there are repeating data values. For example:
Data Set A: 1, 2, 4, 4, 8, 8, 8, 8
Q1
Data Set B: 0, 3, 3, 6, 6, 6, 10, 15
Q1
= 3, Q = 6, Q = 8 = 3, Q = 6, Q = 8 2
3
2
3
These two data sets have identical quartiles, but are not really similar. Because of the small sample size and “gaps” in the data, the quartiles do not represent either data set well.
Box Plots A useful tool of exploratory data analysis (EDA) is the box plot (also called a box-andwhisker plot) based on the five-number summary:
xmin, Q1, Q2, Q3, xmax The box plot is displayed visually, like this. Q1
Xmin
Q2
Q3
Xmax
Below the box plot there is a well-labeled scale showing the values of X. A box plot shows central tendency (position of the median Q2). A box plot shows dispersion (width of the “box” defined by Q1 and Q3 and the range between xmin and xmax). A box plot shows shape (skewness if the whiskers are of unequal length and/or if the median is not in the center of the box). For example, the five number summary for the 68 P/E ratios is: 7, 14, 19, 26, 91 Figure 4.29 shows a box plot of the P/E data. The vertical lines that define the ends of the box are located at Q1 and Q3 on the X-axis. The vertical line within the box is the median (Q2). The “whiskers” are the horizontal lines that connect each side of the box to xmin and xmax and their length suggests the length of each tail of the distribution. The long right whisker suggests right-skewness in the P/E data, a conclusion also suggested by the fact that the median is to the left of the center of the box (the center of the box is the average of Q1 and Q3).
FIGURE 4.29 Simple Box Plot of P/E Ratios (n 68 stocks) (Visual Statistics) PERatios2 =
0
20
40
60
80
100
P/E Ratio
Fences Unusual Data Valu espoints. The idea is to detect data values that We can useand the quartiles to identify unusual data are far below Q1 or far above Q3. The fences are based on the interquartile range Q3
Inner fences Lower fence:
Q1
Upper fence:
Q3
− 1.5( Q 3 − Q 1 ) + 1.5( Q 3 − Q 1)
−Q: 1
Outer fences Q1 Q3
− 3.0( Q 3 − Q 1) + 3.0( Q 3 − Q 1 )
(4.18) (4.19)
148
Applied Statistics in Business and Economics
FIGURE 4.30 Box Plot with Fences (MegaStat) PERatio2
0
20
40
60
80
100
P/E Ratio
Observations outside the inner fences are unusual while those outside the outer fences are outliers. For the P/E data:
Lower fence:
Inner fences
Outer fences
14
14
− 1.5(26 − 14) = −4 26 + 1.5(26 − 14) = +44
− 3.0(26 − 14) = −22 26 + 3.0(26 − 14) = +62
Upper fence: In this example, we can ignore the lower fences (since P/E ratios can’t be negative) but in the right tail there are three unusual P/E values (45, 48, 55) that lie above the inner fence and two P/E values (68, 91) that are outliers because they exceed the outer fence. Unusual data points are shown on a box plot by truncating the whisker at the fences and displaying the unusual data points as dots or asterisks, as in Figure 4.30.
Midhinge Quartiles can be used to define an additional measure of central tendency that has the advantage of not being influenced by outliers. The midhinge is the average of the first and third quartiles: Q1 Q3 Midhinge (4.20) 2 The name “midhinge” derives from the idea that, if the “box” were folded at its halfway point, it would resemble a hinge:
=
Midhinge
Q1
Xmin
+
Q3
Xmax
Since the midhinge is always exactly halfway between Q1 and Q3 while the median Q2 can be anywhere within the “box,” we have a new way to describe skewness:
Q2 < Midhinge Q2 ≅ Midhinge Q2 > Midhinge
SECTION EXERCISES Q1 Q2 Q3
= 22.50 = 26.00 = 33.00
⇒ Skewed right (longer right tail) ⇒ Symmetric (tails roughly equal) ⇒ Skewed left (longer left tail)
4.22 CitiBank recorded the number of customers to use a downtown ATM during the noon hour on 32 consecutive workdays. (a) Use Excel to find the quartiles. What do they tell you? (b) Find the midhinge. What does it tell you? (c) Make a box plot and interpret it. CitiBank 25
37
23
26
30
40
25
26
39
32
21
26
19
27
32
25
18
26
34
18
31
35
21
33
9
16
32
35
42
15
33 24
4.23 An executive’s telephone l og showed the lengths of 65 calls initiat ed during the last week of July. (a) Use Excel to find the qua rtiles. What do they tell you? (b) Find the midhinge. What does it tell you? (c) Make a box plot and interpret it. CallLength
Chapter 4 Descriptive Statistics
1
2
10 5
3 3
2
20 1
6
3
13 2
2 1
26
3 1
1
2
1
1
4
1
4
7
2
2 2
2
3
3
2
13
1
1
6
9 1
1
13
2 1 1 1
1
1 1
3 1
3
1 6
1
1 5
1 3
2 1
2
8
5 2
1 1
8
6
4.6
Mini Case Airline Delays
UnitedAir
In 2005, United Airlines announced that it would award 500 frequent flier miles to every traveler on flights that arrived more than 30 minutes late on all flights departing from Chicago O’Hare to seven other hub airports (see The Wall Street Journal,June 14, 2005). What is the likelihood of such a delay? On a randomly chosen day (Tuesday, April 26, 2005) the Bureau of Transportation Statistics Web site (www.bts.gov) showed 278 United Airlines departures from O’Hare. The mean arrival delay was 7.45 minutes (i.e., flights arrived early, on average). The quartiles were Q1 19 minutes, Q2 10 minutes, and Q3 3 minutes. While these statistics show that most of the flights arrive early, we must look further to estimate the probability of a frequent flier bonus. In the box plot with fences (Figure 4.31) the “box” is entirely below zero. In the right tail, one flight was slightly above the inner fence (unusual) and eight flights were above the outer fence (outliers). An empirical estimate of the probability of a frequent flier award is 8/278 or about a 3% chance. A longer period of study might alter this estimate (e.g., if there were days of bad winter weather or traffic congestion).
=−
=−
FIGURE 4.31
60
−
=−
Box Plot of Flight Arrival Delays
30
0
30
60 90 120 Arrival Delay (minutes)
150
180
210
The dot plot (Figure 4.32) shows that the distribution of arrival delays is rather bell-shaped, except for the unusual values in the right tail.This is consistent with the view that “normal” flight operations are predictable, with only random variation around the mean. While it is impossible for flights to arrive much earlier than planned, unusual factors could delay them by a lot.
FIGURE 4.32
60
Dot Plot of Flight Arrival Delays
30
0
30
60 90 120 Arrival Delay (minutes)
150
180
210
149
150
Applied Statistics in Business and Economics
4.6 CORRELATION
You often hear the term “significant correlation” in casual use, often imprecisely or incorrectly. Actually, thesample correlation coefficientis a well-known statistic that describes the degree of X and Y. The data set consists linearity betweenpaired observations on two quantitative variables of n pairs (xi, yi) that are usually displayed on scatter a plot (review Chapter 3 if you need to refresh your memory about making scatter plots). The formula for the sample correlation coefficient is: n
−¯ −¯ = − ¯ − ¯ ( xi
r
(4.21)
x )( yi
y)
i 1
=
n
( xi
x )2
i 1
n
( yi
y)2
i 1
=
=
−1 ≤ r ≤ +1. When r is near 0 there is little or no linear relationship between X Y. An r valuenegative and near +1relationship. indicates a strong positive relationship, while an r value near −1 indicates a strong Its range is
Strong Negative Correlation
No Correlation
1.00
0.00
Strong Positive Correlation 1.00
Excel’s formula =CORREL(XData,YData) will return the sample correlation coefficient for two columns (or rows) of paired data. In fact, many cheap scientific pocket calculators will calculate r. The diagrams in Figure 4.33 will give you some idea of what various correlations look like. The correlation coefficient is a measure of the linear relationship—so take special note of the last scatter plot, which shows a relationship but not a linear one.
FIGURE 4.33 Illustration of Correlation Coefficients r .90
Y
r .50
r .00
Y
Y
X
X
X
r .90
r .50
r .00
Y
Y
X
Y
X
X
Figure 4.34 shows a scatter plot and correlation coefficient ( r 0.9890) for X gross leasable area (millions of square feet) and Y total retail sales (billions of dollars) for retail shopping malls in 28 states. Clearly, this is a very strong linear relationship, as you would expect (more square feet of shopping area would imply more sales). In Chapter 12, you will learn how to determine when a correlation is “significant” in a statistical sense (i.e., significantly different than zero) but for now it is enough to recognize the correlation coefficient as a descriptive statistic.
=
=
=
Chapter 4 Descriptive Statistics
151
FIGURE 4.34 Shopping Center Area and Sales (n 28 states)
Mall Area and Sales Correlation RetailSales
200 180 160
)s n 140 o i lli 120 (b s 100 le a S li 80 a t e 60 R
Source: U.S. Census Bureau, Statistical Abstract of the United States, 2007, p. 660.
r 0.9890
40 20 0 0
200 400 600 Gross Leasable Area (million sq. ft.)
800
4.24 For each X-Y data set (n 12) make a scatter plot and find the sample correlation coefficient. Is there a linear relationship between X and Y ? If so, describe it. Note: Use Excel or MegaStat or MINITAB if your instructor permits. XYDataSets
=
64 . 7
Y
5 .8
a. r 0.8841 (strong negative) b. r 0.9087 (strong positive) c. r 0.1704 (little or none)
=−
Data Set (a) X
SECTION EXERCISES
2 5 .9 1 8 .1
6 5 .6 1 0 .6
4 9 .6 1 1 .9
5 0 .3 1 1 .4
2 6 .7 1 4 .6
39.5 15.7
5 6 .0
9 0 .8
4 .4
2 .2 4 2 .7
3 5 .9 1 5 .4
3 9 .9
6 4 .1
1 4 .7
9 .9
Data Set (b) X
55 . 1
5 9 .8
7 2 .3
8 6 .4
3 1 .1
4 1 .8
40.7
3 6 .8
Y
15 . 7
1 7 .5
1 5 .2
2 0 .6
7 .3
8 .2
9.8
8 .2
1 3 .7
2 8 .9 1 1 .2
2 4 .8 7 .5
1 6 .2
=
=
4 .5
Data Set (c) X
53 . 3
1 8 .1
4 9 .8
4 3 .8
6 8 .3
3 0 .4
Y
10 . 2
6 .9
1 4 .8
1 3 .4
1 6 .8
9 .5
18.6 16.3
4 5 .8
3 4 .0
1 6 .4
1 .5
5 6 .7 1 1 .4
6 0 .3 1 0 .9
2 9 .3 1 9 .7
4.25 Make a scatter plot of the following data on X home size and Y selling price (thousands of dollars) for new homes (n 20) in a suburb of an Eastern city. Find the sample correlation coefficient. Is there a linear relation ship betweenX and Y ? If so, describe it. Note: Use Excel or MegaStat or MINITAB if your instructor permits. HomePrice
=
=
Square Feet 3,570 3,410 2,690 3,260 3,130 3,460 3,340 3,240 2,660 3,160
SellingPrice (thousands) 861 740 563 698 624 737 806 809 639 778
=
Square Feet 3,460 3,340 3,240 2,660 3,160 3,310 2,930 3,020 2,320 3,130
SellingPrice (thousands) 737 806 809 639 778 760 729 720 575 785
Nature of Grouped Data Sometimes we must work with observations that have been grouped. When a data set is tabulated into bins, we lose information but gain clarity of presentation, because grouped data are often easier to display than raw data. As long as the bin limits are given, we can estimate the mean and standard deviation. The accuracy of the grouped estimates will depend on the number of bins, distribution of data within bins, and bin frequencies.
4.7 GROUPED DATA
152
Applied Statistics in Business and Economics
Grouped Mean and Standard Deviation Table 4.19 shows a frequency distribution for prices of Lipitor ®, a cholesterol-lowering prescription drug, for three cities (see Mini Case 4.2). The observations are classified into bins of equal width 5. When calculating a mean or standard deviation from grouped data, we treat all observations within a bin as if they were located at the midpoint. For example, in the third class (70 but less than 75) we pretend that all 11 prices were equal to $72.50 (the interval midpoint). In reality, observations may be scattered within each interval, but we hope that on average they are located at the class midpoint. Worksheet for Grouped Lipitor® Data (n
TABLE 4.19 From 60 65 70 75 80 85 90
To
f
mj
j
65 70 75 80 85 90 95
6 11 11 13 5 0 1
62.5 67.5 72.5 77.5 82.5 87.5 92.5
Sum
47
Sum x) Mean ( –
f j mj
=
47)
mj
375.0 742.5 797.5 1,007.5 412.5 0.0 92.5
LipitorGrp
−
x–
(m j
−10.42553 −5.42553 −0.42553 4.57447 9.57447 14.57447 19.57447
3,427.5 72.925532
− x–)
2
f (m
108.69172 29.43640 0.18108 20.92576 91.67044 212.41512 383.15980 Sum s) Std Dev (
− x–)
2
652.15029 323.80036 1.99185 272.03486 458.35220 0.00000 383.15980 2,091.48936 6.74293408
Each interval j has a midpoint mj and a frequency fj. We calculate the estimated mean by multiplying the midpoint of each class by its class frequency, taking the sum over all k classes, and dividing by sample size n. k
x
¯=
(4.22)
j 1
=
fj mj n
.5 = 3,427 = 72.925532 47
We then estimate the standard deviation by subtracting the estimated mean from each class midpoint, squaring the difference, multiplying by the class frequency, taking the sum over all classes to obtain the sum of squared deviations about the mean, dividing by n 1, and taking the square root. Avoid the common mistake of “rounding off” the mean before subtracting it from each midpoint.
−
= k
(4.23)
s
f j (m j x ) 2 n 1 j =1
−¯ − =
2,091 .48936 47
− 1 = 6.74293
Once we have the mean and standard deviation, we can estimate the coefficient of variation in the usual way:
CV
= 100(s /x¯ ) = 100(6 .74293/72.925532) = 9.2%
Accuracy Issues How accurate are grouped estimates of x and s? Typically, we would have no way of knowing how much information was lost due to grouping. To the extent that observations are not
¯
evenly spaced within the bins, accuracy would be lost. Unless there is systematic skewness (say, clustering at the low end of each class) the effects of uneven distributions within bins will tend to average out. Accuracy tends to improve as the number of bins increases. If the first or last class is openended, there will be no class midpoint, and therefore no way to estimate the mean. For nonnegative data (e.g., GPA) we can assume a lower limit of zero when the first class is open-ended,
Chapter 4 Descriptive Statistics
153
FIGURE 4.35 Skewness Prototype Populations Skewed Left Skewness0
Normal Skewness 0
Skewed Right 0 Skewness
although this assumption may make the first class too wide. Such an assumption may occasionally be possible for an open-ended top class (e.g., the upper limit of people’s ages could be assumed to be 100) but many variables have no obvious upper limit (e.g., income). It is usually possible to estimate the median and quartiles from grouped data even with open-ended classes (see LearningStats Unit 04, which gives the formulas and illustrates grouped quartile calculations).
Skewness In a general way,skewness (as shown in Figure 4.35) may be judged by looking at the sample histogram, or by comparing the mean and median. However, this comparison is imprecise and does not take account of sample size. When more precision is needed, we look at the sample’s skewness coefficient provided by Excel and MINITAB: Skewness
= (n − 1)(n n − 2)
n
3
− ¯ i 1
=
xi
x
4.8 SKEWNESS AND KURTOSIS
(4.24)
s
This unit-free statistic can be used to compare two samples measured in different units (say, dollars and yen) or to compare one sample with a known reference distribution such as the symmetric normal (bell-shaped) distribution. The skewness coefficient is obtained from Excel’s Tools > Data Analysis > Descriptive Statistics or by the function =SKEW(Data). Table 4.20 shows the expected range within which the sample skewness coefficient would be expected to fall 90 percent of the time if the population being sampled were normal. A sample skewness statistic within the 90 percent range may be attributed to random variation, while coefficients outside the range would suggest that the sample came from a nonnormal population. As n increases, the range of chance variation narrows.
Skewed Left
Symmetric
Lower Limit
n
Lower Limit
Upper Limit
SkewedRight
Upper Limit
n
Lower Limit
Upper Limit
TABLE 4.20
25 30
−0.726
0.726 0.673
90 100
−0.411
0.411 0.391
90 Percent Range for Excel’s Sample Skewness Coefficient
40 50 60 70 80
−0.594 −0.539 −0.496 −0.462 −0.435
0.594 0.539 0.496 0.462 0.435
150 200 300 400 500
−0.322 −0.281 −0.230 −0.200 −0.179
0.322 0.281 0.230 0.200 0.179
Source: Adapted from E. S. Pearson and H. O. Hartley,Biometrika Tables for Statisticians, 3rd ed. (Oxford University Press, 1970), pp. 207–8. Used with permission.
0.673
0.391
Note: Table and formula employ an adjustment for bias.
154
Applied Statistics in Business and Economics
Kurtosis Kurtosis refers to the relative length of the tails and the degree of concentration in the center. A normal bell-shaped population is called mesokurtic and serves as a benchmark (see Figure 4.36). A population that is flatter than a normal (i.e., has heavier tails) is called platykurtic while one that is more sharply peaked than a normal (i.e., has thinner tails) is leptokurtic. Kurtosis is not the same thing as dispersion, although the two are easily confused.
FIGURE 4.36
Platykurtic Heavier tails Kurtosis 0
Kurtosis Prototype Shapes
Mesokurtic Normal peak Kurtosis 0
Leptokurtic Sharper peak Kurtosis 0
A histogram is an unreliable guide to kurtosis because its scale and axis proportions may vary, so a numerical statistic is needed. Excel and MINITAB use this statistic:
(4.25)
Kurtosis
1) = (n − 1)(n(nn−+2)( n − 3)
n
4
− ¯ − i 1
=
xi
x
s
(n
3( n 1) 2 2)( n 3)
−
−
−
The sample kurtosis coefficient is obtained from Excel’s function =KURT(Data). Table 4.21 shows the expectedrange withinwhich sample kurtosis coefficients would be expected to fall 90 percent of the time if the population is normal. A sample coefficient within the ranges shown may be attributed to chance variation, while a coefficient outside this range would suggest that the sample differs froma normal population.As sample sizeincreases, the chance range narrows. Unless you have at least 50 observations, inferences about kurtosis are risky.
Platykurtic
Mesokurtic
Lower Limit
TABLE 4.21 90 Percent Range for Excel’s Sample Kurtosis Coefficient Source: Adapted from E. S. Pearson and H. O. Hartley,Biometrika Tables for Statisticians, 3rd ed. (Oxford University Press, 1970), pp. 207–8. Used with permission. Table and formula employ an adjustment for bias and subtract 3 so the statistic is centered at 0.
n 50 75 100 150 200 300 400 500
LowerLimit
−0.81 −0.70 0.62 −0.53 −0.47 −0.40 −0.35 −0.32
Leptokurtic
Upper Limit
UpperLimit 1.23 1.02 0.87 0.71 0.62 0.50 0.43 0.39
Chapter 4 Descriptive Statistics
4.7
Mini Case Stock Prices
StockPrices
An investor is tracking four stocks, two in the computer data services sector (IBM and EDS) and two in the consumer appliance sector (Maytag and Whirlpool). The analyst chose a two-month period of observation and recorded the closing price of each stock (42 trading days). Figure 4.37 shows MINITAB box plots for the stock prices (note that each has a different price scale).
FIGURE 4.37
Box Plots for Prices of Four Stocks Stock Closing Prices of Four Companies (n 42 days) IBM
84.0
85.5
EDS
87.0
88.5
90.0 19
20
Maytag
16
18
21
Whirlpool
20
56
58
60
62
64
Looking at the numerical statistics (Table 4.22), EDS’s skewness coefficient (1.04) suggests a right-skewed distribution (for n 40, the normal skewness range is .594 to .594). This conclusion is supported by the EDS box plot, with its long right whisker and an outlier in the right tail. Maytag’s kurtosis coefficient ( 1.40) suggests a flatter-thannormal distribution (for n 50, the kurtosis coefficient range is 0.81 to 1.23), although the sample size is too small to assess kurtosis reliably. Maytag’s coefficient of variation (CV ) and coefficient of quartile variation (CQV ) are also high. The Maytag box plot supports the view of high relative variation. In addition to patterns in price variation, an investor would consider many other factors (e.g., prospects for growth, dividends, stability, etc.) in evaluating a portfolio.
=
+
=
TABLE 4.22 Statistic
−
−
−
+
Four Companies’ Stock Prices (September–October 2004) IBM
EDS
Maytag
Whirlpool
Mean Standarddeviation Skewness Kurtosis CV (%) Quartile 1
86.40 1.70 0.51 0.62 2.0% 84.97
19.86 0.59 1.04 0.72 3.0% 19.40
18.39 1.59 0.23 1.40 8.6% 17.07
59.80 1.94 0.48 0.16 3.2% 58.53
Quartile 2 Quartile 3 CQV (%)
86.25 87.37 1.4%
19.80 20.13 1.8%
18.10 20.11 8.2%
60.20 61.28 2.3%
−
−
− −
® Source: Data are from the Center for Research in Security Prices (CRSP ), a financial research center at the University of Chicago Graduate School of Business. Example is for statistical education only, and not as a guide to investment decisions.
155
156
Applied Statistics in Business and Economics
Excel Hints Hint 1: Formats When You Copy Data from Excel Excel’s dollar format (e.g., $214.07) or comma format (e.g., 12,417) will cause many statistical packages (e.g., MINITAB or Visual Statistics) to interpret the pasted data as text (because “$” and “,” are not numbers). For example, in MINITAB a column heading C1-T indicates that the data column is text. Text cannot be analyzed numerically, so you can’t get means, medians, etc. Check the format before you copy and paste. Hint 2: Decimals When You Copy Data from Excel Suppose you have adjusted Excel’s decimal cell format to display 2.4 instead of 2.35477. When you copy this cell and paste it into MINITAB, the pasted cell contains 2.4 (not 2.35477). Thus, Excel’s statistical calculations (based on 2.35477) will not agree with MINITAB (like-
wise for Visual can Statistics). If you copy several columns of data (e.g., for a regression model), the differences be serious.
CHAPTER SUMMARY
KEY TERMS
The mean and median describe a sample’s central tendencyand also indicate skewness.The mode is useful for discrete data with a small ran ge. The trimmed meaneliminates extreme values. The geometric mean mitigates high extremes but fails when zeros or negative values are present. The midrangeis easy to calculate but is sensitive to extremes. Dispersion is typically measured by the standard deviation while relative dispersion is given by the coefficient of variation for nonnegative data. Standardized datareveal outliersor unusual data values, and the Empirical Ruleoffers a comparison with a normal distribution. In measuring dispersion, themean absolute deviation or MAD is easy to understand, but lacks nice mathematical properties.Quartilesare meaningful even for fairly small data sets, while percentilesare used only for large data sets. Box plotsshow the quartiles and data range. The correlation coefficient measures the degree of linearity between two variables. We can estimate many common descriptive statistics from grouped data.Sample coefficients of skewnessand kurtosis allow more precise inferences about the shape of the population being sampled instead of relying on histograms.
Central Tendency geometric mean, 128 mean, 118 median, 119 midhinge, 148 midrange, 129 mode, 120 trimmed mean, 129
Dispersion Chebyshev’s Theorem, 139 coefficient of variation, 135 Empirical Rule, 139 mean absolute deviation, 135 outliers, 140 population variance, 133 range, 132 sample variance, 133 standard deviation, 133 standardized variable, 140 two-sum formula, 134
Shape bimodal distribution, 122 kurtosis, 154 leptokurtic, 154 mesokurtic, 154 multimodal distribution 122 negatively skewed,123 platykurtic, 154 positively skewed, 123 skewed left, 123 skewed right, 123 skewness, 123 skewness coefficient, 153 symmetric data, 123
Other box plot, 147 five-number summary, 147 interquartile range, 147 method of medians, 144 quartiles, 144 sample correlation coefficient, 150
Commonly Used Formulas in Descriptive Statistics
¯ = n1
Sample mean: x
n
xi
i 1
=
√ Geometric mean: G = n x1 x2 ·· · xn
Chapter 4 Descriptive Statistics
Range: Range
= xmax − xmin xmin + xmax Midrange: Midrange = 2
= n
Sample standard deviation:s Coefficient of variation:CV
Standardized variable:z i
=
=
− x¯ )2 n−1 ( xi
= 100 × xs¯
= xi −σ µ Q1
Midhinge: Midhinge
i 1
Q3
+2
n
−¯ −¯ = −¯ −¯ ( xi
Sample correlation coefficient:r
=
( xi
i 1
k
Grouped mean: x
¯=
j 1
=
=
fj m j n
x )( yi
y)
i 1
n
n
x )2
( yi
y)2
i 1
=
1. What are descrip tive statistics? How do they differ from vis ual displays of data? 2. Explain each concept: (a) cen tral tendency, (b) dispersion, and (c) shape. 3. (a) Why is sorting usuall y the first step in data analysi s? (b) Why is it useful to begin a data analysis by thinking about how the data were collected? 4. List strengths and weaknesses of each meas ure of central tenden cy and write its Excel func tion: (a) mean, (b) median, and (c) m ode. 5. (a) Why must the deviations around the mean sum to zero ? (b) What is the pos ition of the media n in the data array when n is even? When n is odd? (c) Why is the mode of little use in continuous data? For what type of data is the mode most useful? 6. (a) What is a bimodal dis tribution? (b) Explain tw o ways to detect sk ewness. 7. List strengths and weakn esses of each measu re of central tendenc y and give its Excel functi on (if any): (a) midrange, (b) geometric mean, and (c) 10 percent trimmed mean. 8. (a) What is disp ersion? (b) Nam e five measures of dispersion . List the main characte ristics (strengths, weaknesses) of each measure. 9. (a) Which standard deviation formula (pop ulation, sample) is used most often? W hy? (b) When is the coefficient of variation useful? When is it useless? 10. (a) To what kind of data does Chebyshev’s Theorem apply? (b) To what kind of data does the Empirical Rule apply? (c) What is an outlier? An unusual data value? 11. (a) In a normal distribu tion, approximately what percent of obse rvations are wit hin 1, 2, and 3 standard deviations of the mean? (b) In a sample of 10,000 observations, about how many observations would you expect beyond 3 standard deviations of the mean? 12. (a) Write the mathematica l formula for a standard ized variable. (b) Write the Excel formula for stan dardizing a data value in cell F17 from an array with mean Mu and standard deviation Sigma. 13. (a) Why is it dangerous to delete an outlier ? (b) When migh t it be acceptable to delete an outlier ? 14. (a) Explain how quartiles can measure both centr ality and dispers ion. (b) Why don’t we calcu late percentiles for small samples? 15. (a) Explain the meth od of medians for calcula ting quartiles. (b ) Write the Excel formu la for the first quartile of an array namedXData. 16. (a) What is a box plot? What doe s it tell us? (b) What is the rol e of fences in a box plot? (c) Defi ne the midhinge and midspread (interquartile range). 17. What does a correlation c oefficient measure? What is its range?
CHAPTER REVIEW
157
158
Applied Statistics in Business and Economics
18. (a) Why is some accuracy lost when we estimate the mean or standard deviation from grouped data? (b)Whydo open-ended classes in a frequencydistr ibution make it impossible to estimate themean and standarddevia tion?(c) When wouldgroup ed data be presentedinste adof theentir e sampleof raw data? 19. Optional (a) What is the skewness coefficient of a normal distribution? A uniform distribution? (b) Why do we need a table for sample skewness coefficients that is based on sample size? 20. Optional (a) What is kurtosis? (b) Sketch a platykurtic population, a leptokurtic population, and a mesokurtic population. (c) Why can’t we rely on a histogram to assess kurtosis?
CHAPTER EXERCISES
Note: Unless otherwise instructed, you may use any desired statistical software for calculations and graphs in the following problems. DESCRIBING DATA
a. Mean 724.67 Median 720 Mode 730 S.D. 114.3
=
=
= =
4.26 Below are monthly rents paid by 30 students who live off campus. (a) Find the mean, median, mode, standard deviation, and quartiles. (b) Describe the “typical” rent paid by a student. (c) Do the measures of central tendency agree? Explain. (d) Sort and standardize the data. (e) Are there outliers or unusual data values? (f ) Using the Empirical Rule, do you think the data could be from a normal population? Rents 73 0
730
69 0
1, 030
73 0
93 0
56 0
740
65 0
66 0
8 50
9 30
60 0
620
76 0
69 0
7 10
5 00
73 0
800
82 0
84 0
7 20
7 00
74 0
7 00
62 0
5 70
7 20
6 70
4.27 How many days in advance do travelers purchase their airl ine tickets? Below are da ta showing the advance days for a sample of 28 passengers on United Airlines Flight 815 from Chicago to Los Angeles. (a) Prepare a dot plot and discuss it. (b) Calculate the mean, median, mode, and midrange. (c) Calculate the quartiles, midhinge, and coefficient of quartile variation. (d) Why can’t you use the geometric mean for this data set? (e) Which is the best measure of central tendency? Why? (Data are from The New York Times,April 12, 1998.) Days 11
Mode (not ratio or interval) data
7
11
4
15
14
71
29
8
7
16
28
17
2 49
0 20 77 18 14 3 15 52 20 0 9 9 21 3 4.28 In a particular week, the cable channel TCM (Turner Classic Movies) showed seven movies rated ****, nine movies rated ***, three movies rated **, and one movie rated *. Which measure of central tendency would you use to describe the “average” movie rating on TCM (assuming that week was typical)? (Data are fromTV Guide.) 4.29 The “expense ratio” is a measure of the cost of managing the portfolio. Investors prefer a low expense ratio, all else equal. Below are expense ratios for 23 randomly chosen stock funds and 21 randomly chosen bond funds. (a) Calculat e the mean, median, and mode for each sample. (b) Succinctly compare central tendency in expense ratios for stock funds and bond funds. (c) Calculate the standard deviation and coefficient of variation for each sample. Which type of fund has more variability? Explain. (d) Calculate the quartiles and midhinge. What do they tell you? (Data are from Money 32, no. 2 [February 2003]. Stock funds were selected from 1,699 funds by taking the 10th fund on each page in thelist . Bond funds were selectedfrom 499fund s by taking the 10th, 20th, and 30th fund on each page in the list.) Funds
23 Stock Funds 1.12
1.44
1 .2 7
1 .7 5
0 .9 9
1 .4 5
1 .1 9
1 .2 2
0 .9 9
3 .1 8
1 .2 1
0.60
2.10
0 .7 3
0 .9 0
1 .7 9
1 .3 5
1 .0 8
1 .2 8
1 .2 0
1 .6 8
0 .1 5
0 .9 4
0 .7 5
1 .8 9
21 Bond Funds 1.96
0.51
1 .1 2
0 .6 4
0 .6 9
0 .2 0
1 .4 4
0 .6 8
0 .4 0
0.93
1.25
0 .8 5
0 .9 9
0 .9 5
0 .3 5
0 .6 4
0 .4 1
0 .9 0
1 .7 7
4.30 Statistics students were asked to fill a one-cup measure with raisin bran, tap the cup lightly on the counter three times to settle the contents, if necessary add more raisin bran to bring the contents exactly to the one-cup line, spread the contents on a large plate, and count the raisins. The 13 students
Chapter 4 Descriptive Statistics
who chose Kellogg’s Raisin Bran obtained the results shown below. (a) Use Excel to calculate the a. mean, median, mode, and midrange. (b) Which is the best measure of central tendency, and why? (c) Calculate the standard deviation and coefficient of variation. (d) Why is there variation in the number of raisins in a cup of raisin bran? Why might it be difficult for Kellogg to reduce variation?
Raisins 23
33
44
36
29
42
31
33
61
36
34
23
Mean 34.54 Median 33.0 Modes 23 and 33 Midrange 42
= = =
=
24
4.31 The table below shows estimates of the total cost of repairs in four bumper tests (full frontal, front corner, full rear, rear corner) on 17 cars. (a) Make a dot plot for the data. (b) Calculate the mean and median. (c) Would you say the data are skewed? (d) Why is the mode not useful for these data? (Data are from Insurance Institute for Highway Safety,Detroit Free Press, March 1, 2007, p. 2E).
CrashDamage
Car Tested
Damage
ChevroletMalibu ChryslerSebring FordFusion HondaAccord HyundaiSonata KiaOptima
C ar Tested
6,646 7,454 5,030 8,010 7,565 5,735
Damage
Mazda6 MitsubishiGalant NissanAltima NissanMaxima PontiacG6 SaturnAura
C ar Tested
4,961 4,277 6,459 9,051 8,919 6,374
Damage
SuburuLegacy ToyotaCamry VolkswagenJetta VolkswagenPassat VolvoS40
7,448 4,911 9,020 8,259 5,600
4.32 Salt-sensitive people must be careful of sodium content in foods. The sodium content (milligrams) b. in a 3-tablespoon serving of 33 brands of peanut butter is shown below. (a) Prepare a dot plot and discuss it. (b) Calculate the mean, median, mode, and midrange. (c) Which is the best measure of central tendency? The worst? Why? (d) Why would the geometric mean not work here? (e) Sort and standardize the data. (f) Are there outliers? Unusual data values? (Data are fromConsumer Reports 67, no. 5.) Sodium 98
22 5
2 25
225
23
0
225
8
3 75
225
27 0
2 85
165
16 5
1 80
180
0
3 00
210
0
180
21 0
2 10 1 80
225
210
195
16 5
195
180
18 8
Mean 179.9 Median 195.0 Mode 225 Midrange 187.5
= = =
=
240
173
4.33 Below are the lengths (in yards) of 27 18-hole golf courses in Oakland County, Michigan. (a) Prepare a dot plot and discuss it. (b) Calculate the mean, median, mode, and midrange. (c) Which is the best measure of central tendency? The worst? Why? (d) Why would the geometric mean pose a problem for this data set? (Data are from Detroit Free Press, April 13, 1995.)
Golf 564 6
5 767
580 0
5 820
600 5
6 078
61 00
6 110
61 79
618 6
6 306
636 6
6 378
640 0
6 470
64 74
6 494
65 00
650 0
6 554
655 5
6 572
661 0
6 620
66 47
6 845
70 77
4.34 A false positive occurs when a radiologist who interprets a mammogram concludes that cancer is b. present, but a biopsy subsequently shows no breast cancer. False positives are unavoidable because mammogram results often are ambiguous. Below are false positive rates (percent) for 24 radiologists who interpreted a total of 8,734 mammograms. (a) Prepare a dot plot and interpret it. (b) Calculate the mean, median, mode, and midrange. (c) Which is the best measure of central tendency? The worst? Why? (Data are from Joann G. Elmore et al., “Screening Mammograms by Community Radiologists: Variability in False Positive Rates,”Journal of the National Cancer Cancer Institute 94, no. 18 [September 18, 2002], p. 1376.) 8 .5
4 .9
12.5
2 .6
7 .6
1 5 .9
4 .0
6 .9
6.0
6 .7
6 .5
9 .5
5 .6 2 .7
9 .0 5 .3
9 .0 4 .4
1 0 .8 3 .5
1 0 .2 1 1 .9
1 2 .2 4 .2
4.35 The table below shows percentiles of height (in cm) for 20-year-old males and females. (a) Calculate the midhinge and coefficient of quartile variation ( CQV). Why are these statistics appropriate to measure centrali ty and dispersion in this situation ? (b) Choose a 20 year old whose height you know and describe that person’s height (Note: 1 in. 2.54 cm) in comparison with these percentiles. (c) Do you suppose that height percentiles change over time for the population of a specified nation? Explain. (Data are from the National Center for Health Statistics, www.fedstats.gov.)
=
159
Mean 7.52 Median 6.80 Mode 9.0 Midrange 9.25
= = =
=
160
Applied Statistics in Business and Economics
Selected Percentiles for Heights of 20 Year Olds (cm) Gender
5%
Male Female
a. Mean 95.1 Median 90.0 Mode none
=
= =
25%
165 153
50%
172 159
75%
95%
177 163
182 168
188 174
4.36 Grace took a random sample of the number of steps per minute from the electronic readout of her aerobic climbing machine during a 1-hour workout. (a) Calculate the mean, median, and mode. (b) Which is the best measure of central tendency? The worst? Why? (Data are from a project by Grace Obringer, MBA student.) Steps 90
110
97
144
54
60
156
86
82
64
10 0
47
80
164
93
4.37 How much revenue does it take to maintain a cricket club? The following table shows annual income for 18 first-class clubs that engage in league play. (a) Calculate the mean, median, and mode. Show your work carefully. (b) Describe a “typical” cricket club’s income. (Data are from The Economist 367, no. 8329 [June 21, 2003], p. 47.)
Annual Income of First-Class Cricket Clubs in England Club
Income(£000)
Lancashire Surrey Derbyshire Middlesex Somerset Nottinghamshire Kent Leicestershire Sussex
Both CVs are the same at 25%.
Cricket
Club
5,366 6,386 2,088 2,280 2,544 3,669 2,894 2,000 2,477
Income(£000)
Durham Worcestershire Gloucestershire Northamptonshire Glamorgan Essex Warwickshire Yorkshire Hampshire
3,009 2,446 2,688 2,416 2,133 2,417 4,272 2,582 2,557
4.38 A plumbing supplier’s mean monthly demand for vinyl washers is 24,212 with a standard deviation of 6,053. The mean monthly demand for steam boilers is 6.8 with a standard deviat ion of 1.7. Compare the dispersion of these distributions. Which demand patter n has more relative variation? Explain. 4.39 The table below shows average daily sales of Rice Krispies in the month of June in 74 Noodles & Company restaurants. (a) Make a histogram for the data. Would you say the distribution is skewed? (b) Calculate the mean, median, and mode. Which best describes central tendency, and why? (c) Find the standard deviation. (d) Are there any outliers? RiceKrispies
c.
Real estate has highest CV and Fed funds lowest.
32
8
14
20
16
29
11
34
28 31
19 18
22
17
27
24
49
25
18
25
21
15
16
20
21
29
14
25
10
21
28
27
26
12
24
18
19
24
16
17
20
23
13
17
17
19
36
16
34
25
15
16
13
20
13
13
23
17
22
11
17
17
9
15
37
8
31
12
16
12
16 16 11 19
4.40 Analysis of portfolio retur ns over the period 1981–2000 showed the statistics below. (a) Calculate and compare the coefficients of variation. (b) Why would we use a coefficient of variation? Why not just compare the standard deviations? (c) What do the data tell you about risk and return at that time period? Returns
Chapter 4 Descriptive Statistics
Comparative Returns on Four Types of I nvestments Investment Venturefunds(adjusted) Allcommonstocks Real estate Federalshorttermpaper
Mean Return 19.2 15.6 11.5 6.7
Standard Deviation
Coefficient of Variation
14.0 14.0 16.8 1.9
CV 72.9 89.7 146.1 28.4
Source: Dennis D. Spice a nd Stephen D. Hogan, “Venture Investing and the Role of Financial Advisors,” Journal of Financial Planning15, no. 3 (March 2002), p. 69. These statistics are for educational use only and should not be viewed as a guide to investing.
4.41 Analysis of annualized return s over the period 1991–2001 showed that prepaid tuition plans had a mean return of 6.3 percent with a standard deviation of 2.7 percent, while the Standard & Poor’s 500 stock index had a mean return of 12.9 percent with a standard deviation of 15.8 percent. (a) Calculate and compare the coefficients of variation. (b) Why would we use a coefficient of variation? Why not just compare the standard deviations? (c) What do the data say about risk and return of these investments at that time? (Data are from Mark C. Neath, “Section 529 Prepaid Tuition Plans: A Low Risk Investment with Surprising Applications,” Journal of Financial Planning 15, no. 4 [April 2002], p. 94.) 4.42 Caffeine content in a 5-ounce cup of brewed coffee ranges from 60 to 180 mg, depending on brew Mean 120 time, coffee bean type, and grind. (a) Use the midrange to estimate the mean. (b) Why is the S.D. 20 assumption of a normal, bell-shaped distribution important in making these estimates? (c) Why if normal might caffeine content of coffee not be normal? (Data are from Popular Science 254, no. 5 [May 1999].)
≈ ≈
4.43 Chlorine is added to all city water to kill ba cteria. In 2001, chlorine content in water from the Lake Huron Water Treatment plant ranged from 0.79 ppm (parts per million) to 0.92 ppm. (a) Use the midrange to estimate the mean. (b) Why is it reasonable to assume a normal distribution in this case? (Data are from City of Rochester Hills, Michigan, 2002 Water Quality Report.) THINKING ABOUT DISTRIBUTIONS
4.44 At the Midlothian Independent Bank, a study shows that the mean ATM transaction takes 74 seconds, the median 63 seconds, and the mode 51 seconds. (a) Sketch the distri bution, based on these statistics. (b) What factors might cause the distribution to be like this? 4.45 At the Eureka library, the mean time a book is checked out is 13 days, the median is 10 days, and the mode is 7 days. (a) Sketch the distribution, based on these statistics. (b) What factors might cause the distribution to be like this? 4.46 On Professor Hardtack’s last cost accounting exam, the mean score was 71, the median was 77, Left-skewed and the mode was 81. (a) Sketch the distribution, based on these statistics. (b) What factors might cause the distribution to be like this? 4.47 (a) Sketch the histogram you would expect for t he number of DVDs owned byn randomly chosen families. (b) Describe the expected relati onship between the mean, median, and mode. How could you test your ideas about these data? 4.48 (a) Sketch the histogram you would expect for the price of regular gasoline yesterday atn service stations in your area. (b) Guess the range and median. (c) How could you test your ideas about these data?
Answers will vary
4.49 The median life span of a mouse is 118 weeks. (a) Would you expect the mean to be higher or lower than 118? (b) Would you expect the life spans of mice to be normally distributed? Explain. (Data are from Science News 161, no. 3 [January 19, 2002].) 4.50 The median waiting time for a liver transplant in the U.S. is 1,154 days for patients with type O a. blood (the most common blood type). (a) Would you expect the mean to be higher or l ower than 1,154 days? Explain. (b) If someone dies while waiting for a transplant, how should that be counted in the average? (Data are fromThe New York Times, September 21, 2003.) 4.51 A small suburban community agreed to purchase police services from the county sheriff’s department. The newspaper said, “In the past, the charge for police protection from the Sheriff’s Department has been based on the median cost of the salary, fringe benefits, etc. That is, the cost per deputy was set halfway between the most expensive deputy and the least expensive.” (a) Is this the median? If not, what is it? (b) Which would probably cost the city more, the midrange or the median? Why?
Mean > Median. Right-skewed.
161
162
c.
Applied Statistics in Business and Economics
Mean > Median. Union would oppose the change.
4.52 A company’s contractual “trigger” point for a union absenteeism penalty is a certain distance above the mean days missed by all workers. Now the company wants to switch the trigger to a certain distance above the median days missed for all workers. (a) Visualize the distribution of missed days for all workers (symmetric, skewed left, skewed right). (b) Discuss the probable effect on the trigger point of switching from the mean to the median. (c) What position would the union be likely to take on the company’s proposed switch? EXCEL PROJECTS
4.53 (a) Use Excel functions to calcul ate the mean and standard deviation for weekend occupancy rates (percent) in nine resor t hotels during the off-season. (b) What conclusion would a casual obser ver draw about centrality and dispersion, based on your statistics? (c) Now calculate the median for each sample. (d) Make a dot plot for each sample. (e) What did you learn from the medians and dot plots that was not apparent from the means and standard deviations? Occupancy Observation 1 2 3 4 5 6 7 8 9
Sample is large enough that statistics will be close.
Week1
Week2
32 41 44 47 50 53 56 59 68
Week3
33 35 45 50 52 54 58 59 64
Week4 38 39 39 40 56 57 58 61 62
37 42 45 46 47 48 50 67 68
4.54 (a) Enter the Excel function =ROUND(NORMINV(RAND(),70,10),0) in cells B1:B100.This will create 100 random data points from a normal distribution using parametersµ 70 and σ 10. Think of these numbers as exam scores for 100 students. (b) Use the Excel functions=AVERAGE(B1:B100) and =STDEV(B1:B100) to calculate the sample mean and standard deviation for your dat a array. (c) Every time you press F9 you will get a new sample. Watch the sample statistics and compare them with the desired parameters µ 70 and σ 10. Do Excel’s random samples have approximately the desired characteristics? (d) Use Excel’s =MIN(B1:B100)and =MAX(B1:B100)to find the range of your samples. Do the sample ranges look as you would expect from the Empirical Rule?
=
=
=
=
GROUPED DATA
Note: In each of the following tables, the upper bin limit is excluded from that bin, but is included as the lower limit of the next bin. 4.55 This table shows the fertility rate (children born per woman) in 191 world nations. (a) From the grouped data, calculate the mean, standard deviation, and coefficient of variation for each year. Show your calculations clearly in a worksheet. (b) Write a concise comparison of central tendency and dispersion for fert ility rates in the two years. (c) What additional information would you have gained by having the raw data? (d) What benefit arises from having only a summary table?
Fertility Rates in World Nations
Fertile
From
1990
To
2000
1.00 2.00 3.00 4.00 5.00 6.00 7.00
2.00 3.00 4.00 5.00 6.00 7.00 8.00
39 35 27 26 24 30 9
54 44 23 22 26 16 5
8.00
9.00 Total
1 191
1 191
Source: World Health Organization.
4.56 This table shows the distribution of winning times in the Kentucky Derby (a horse race) over 74 years. Times are recorded to the nearest 1 /5 second (e.g., 121.4). (a) From the grouped data, calculate the mean. Show your calculations clearly in a worksheet. (b) What additional information
Chapter 4 Descriptive Statistics
would you have gained by having the raw data? (c) Do you think it likely that the distribution of a. times within each interval might not be uniform? Why would that matter?
Kentucky Derby Winning Times (seconds) From
Mean
= 123.0
Derby
To
f
119 120 121 122 123 124 125 126
120 121 122 123 124 125 126 127
1 5 16 22 12 8 5 3
127
128 Total
2 74
Source: Sports Illustrated 2004 Sports Almanac.
4.57 This table shows the numberof heating degree-days in December in 35 U.S.citie s.A heating degreeday is the sum over all days in the month of the difference between 65 degrees Fahrenheit and the daily mean temperature of each city. (a) From the grouped data, calculate the mean. Show your calculations clearly in a worksheet. (b) What additional information would you have gained by having the raw data? (c) Why do you suppose that unequal class intervals were used in this table? Does it affect the calculations you did?
December Heating Degree-Days in 35 Cities From
HeatGrp
To
0 250 500 1,000
f
250 500 1,000 2,000
2 5 14 14
Total
35
Source: U.S. Bureau of the Census,Statistical Abstract of the United States, 1986, p. 219.
4.58 This table shows the life expectancy at birth (in years) in 153 world nations. (a) From the grouped a. data, calculate the mean, standard deviation, and coefficient of variation. Show your calculations clearly in a worksheet. (b) Write a concise summary of central tendency and dispersion for population increase in world nations. (c) What additional information would you have gained by having the raw data? (d) What is the advantage of having only a summary table?
Life Expectancy at Birth in Large Nations From 20 30 40 50 60 70 80
Life
To
f 30 40 50 60 70 80 90
1 9 20 17 36 67 3
Total
153
Source: U.S. Central Intelligence Agency,The World Factbook, 2003 . Omits nations with population under 1 million.
4.59 The self-reported number of hours worked per week by 204 top executives is given below. (a) Estimate the mean, standard deviation, and coefficient of variation, using an Excel worksheet to organize your calculations. (b) Do the unequal class sizes hamper your calculations? Why do you suppose that was done?
Mean 64.02 S.D. 13.37 CV 20.9%
= = =
163
164
Applied Statistics in Business and Economics
Weekly Hours of Work by Top Executives From
Work
To
40 50 60 80
f
50 60 80 100 Total
12 116 74 2 204
Source: Lamalie Associates,Lamalie Report on Top Executives of the 1990s, p. 11.
a. Mean 14.25 S.D. 13.67 CV 95.9%
= = =
4.60 The table below shows the self-reported number of books read annually by 204 top executives. (a) Estimate the mean, standard deviation, and coefficient of variation, using an Excel worksheet to organize your calculations. (b) Do the unequal class sizes hamper your calculations? Why do you suppose that was done?
Books Read Annually by Top Executives From
Books
To
0 3 6 10 20 30
f
3 6 10 20 30 50 Total
23 46 37 44 31 23 204
Source: Lamalie Associates,Lamalie Report on Top Executives of the 1990s, p. 12.
4.61 (a) Which sample statistics, if any, can you obtain from the following data on farm size? Explain. (b) Why were unequal class intervals and open-end classes used?
Distribution of U.S. Farms by Size (acres) From
FarmSize
To
0 10 50 100 180 260 500 1,000 2,000 and over
NumberofFarms(000)
10 50 100 180 260 500 1,000 2,000
154 411 295 298 165 238 176 101 75
Total
1,913
Source: U.S. Bureau of the Census,Statistical Abstract of the United States, 2002.
4.62 How long does it take to fly from Denver to Atlanta on Delta Airlines? The table below shows 56 observations on flight times (in minutes) for the first week of March 2005. (a) Use the grouped data formula to estimate the mean and standard deviation. (b) Using the ungrouped data (not shown) the ungrouped sample mean is 161.63 minutes and the ungrouped standard deviation is 8.07 minutes. How close did your grouped estimates come? (c) Why might flight timesnot be uniformly distributed within the second and third class intervals? Source: ( www.bts.gov.)
Flight Times DEN to ATL (minutes) From
Grouped Ungrouped Mean 161.61 161.63 SD 7.855 7.926
140 150 160 170 180
DeltaAir To
Total
150 160 170 180 190
frequency 1 25 24 4 2 56
Chapter 4 Descriptive Statistics
165
DO-IT-YOURSELF SAMPLING
4.63 (a) Record the points scored by the winning team in 50 college football games played last weekend (if it is not football season, do the same for basketball or another sport of your choice). If you can’t find 50 scores, do the best you can. (b) Make a dot plot. What does it tell you? (c) Make a frequency distribution and histogram. Describe the histogram. (d) Calculate the mean, median, and mode. Which is the best measure of central tendency? Why? (e) Calculate standard deviation and coefficient of variation. (f) Standardize th e data. Are there any outliers? (g) Find the quartiles. What do they tell you? (h) Make a box plot. What does it tell you? 4.64 (a) Record the length (in minutes) of 50 movies chosen at random from a movie guide (e.g., Leonard Maltin’s Movie and Video Guide). Include the name of each movie. (b) Make a dot plot. What does it tell you? (c) Make a frequency distribution and histogram. Describe the histogram. (d) Calculate the mean, median, and mode. Which is the best measure of central tendency? Why? (e) Calculate the standard deviation and coefficient of variation. (f) Standardize the data. Are there any outliers? (g) Find the quartiles and coefficient of quartile variation. What do they tell you?
(h) Make a box plot. Wh at does it tell you? SCATTER PLOTS AND CORRELATION
Note: Exercises 4.65 and 4.66 refer to data sets on the CD. 4.65 (a) Make an Excel scatter plot of X 1990 assault rate per 100,000 population and Y 2004 assault rate per 100,000 population for the 50 U.S. states. (b) Use Excel’s =CORREL function to find the correlation coeffic ient. (c) What do the graph and correlation coeffi cient say about assault rates by state for these two years? (d) Use MegaStat or Excel to find the mean, median, and standard deviation of assault rates in these two years. What does this comparison tell you?
=
=
Assault airspeed (nautical miles per hour) and Y cockpit noise Noise 4.66 (a) Make an Excel scatter plot of X 0.0765 Airspeed + level (decibels) for 61 aircraft flights. (b) Use Excel’s=CORREL function to find the correlation co- 64.23 (r .9459) efficient. (c) What do the graph and correlation coefficient say about the relationship between airspeed and cockpit noise? Why might such a relationship exist?Optional: Fit an Excel trend line to the scatter plot and interpret it. CockpitNoise
=
=
=
MINI-PROJECTS
4.67 (a) Choosea data set and prepare a brief,descr iptive report.You may use any computer software you wish (e.g., Excel, MegaStat,Visu al Statistics, MINITAB). Include relevant worksheets or graphs in your report. If some questions do not apply to your data set, explain why not. (b) Discuss any possible weaknesses in the data. (c) Sort the data. (d) Make a dot plot. What does it tell you? (e) Make a histogram. Describe its shape. (f) Calculate the mean, median, and mode(s) and use them to describe central tendency for this data set. (g) Calculate the standard deviation and coefficient of variation. (h) Standardize the data and check for outliers. (i) Compare the data with the EmpiricalRule . Discuss. (j) Calculate the quartiles and interpret them. (k)Make a boxplot. Describe itsappear ance.
DATA SET A
Advertising Dollars as Percent of Sales in Selected Industries (n 30) Ads =
Industry
Percent
Accidentandhealthinsurance Apparelandotherfinishedproducts Beverages CableandpayTVservices Computerdataprocessing Computerstoragedevices Cookiesandcrackers Drugandproprietarystores Electrichousewaresandfans Equipmentrentalandleasing
0.9 5.5 7.4 1.3 1.1 1.8 3.5 0.9 6.4 2.0
Jewelrystores Managementservices Millwork,veneer,andplywood Misc.furnitureandfixtures Mortgagebankersandloans Motorcycles,bicycles,andparts Paints,varnishes,lacquers Perfumeandcosmetics Photographicequipment Racingandtrackoperations
Industry
Footwear exceptrubber Greeting cards Grocerystores Hobby,toy,andgamesshops Icecreamandfrozendesserts
4.5 3.5 1.1 3.0 2.0
Real estateinvestmenttrusts Shoe stores Steelworksandblastfurnaces Tiresandinnertubes Wine,brandy,andspirits
Percent 4.6 1.0 3.5 2.3 4.9 1.7 3.1 11.9 4.3 2.5 3.8 3.0 1.9 1.8 11.3
Source: George E. Belch and Michael A. Belch,Advertising and Promotion, pp. 219–220. Copyright © 2004 Richard D. Irwin. Used with permission of McGraw-Hill Companies, Inc.
=
166
Applied Statistics in Business and Economics
DATA SET B
Maximum Rate of Climb for Selected Piston Aircraft (n ClimbRate
Manufacturer/Model
Year
Climb (ft./min.)
AMDCH2000 BeechBaron58 BeechBaron58P BeechBaronD55 BeechBonanzaB36TC BeechDuchess BeechSierra BellancaSuperViking Cessna152
2000 1984 1984 1968 1982 1982 1972 1973 1978
820 1,750 1,475 1,670 1,030 1,248 862 1,840 715
Cessna170B Cessna 172RSkyhawk Cessna172RGCutlass Cessna1825Skylane Cessna182QSkylane Cessna310R Cessna337GSkymotorII Cessna414A Cessna421B CessnaCardinal CessnaP210 CessnaT210K CessnaT303Crusader CessnaTurboSkylaneRG Cessna Turbo Skylane T182T CessnaTurboStationairTU206 CessnaU206H Cirrus SR20
1953 1997 1982 1997 1977 1975 1975 1985 1974 1970 1982 1970 1983 1979 2001 1981 1998 1999
690 720 800 865 1,010 1,662 1,100 1,520 1,850 840 945 930 1,480 1,040 1,060 1,010 1,010 946
Manufacturer/Model DiamondC1Eclipse ExtraExtra400 LancairColumbia300 LibertyXL-2 MauleComet Mooney231 MooneyEagleM205 MooneyM20C MooneyOvation2M20R
=
54) Climb (ft./min.)
Year 2002 2000 1998 2003 1996 1982 1999 1965 2000
1,000 1,400 1,340 1,150 920 1,080 1,050 800 1,150
OMF Aircraft Symphony 2002 Piper 125 TriPacer 1951 PiperArcherIII 1997 PiperAztecF 1980 PiperDakota 1979 PiperMalibuMirage 1998 PiperMalibuMirage 1989 PiperSaratogaIITC 1998 PiperSaratogaSP 1980 PiperSenecaIII 1982 PiperSenecaV 1997 PiperSenecaV 2002 PiperSuperCab 1975 PiperTurboLance 1979 Rockwell Commander 114 1976 SkyArrow650TC 1998 SocataTB20Trinidad 1999 Tiger AG-5B 2002
850 810 667 1,480 965 1,218 1,218 818 1,010 1,400 1,455 1,455 960 1,000 1,054 750 1,200 850
Source: Flying Magazine (various issues from 1997 to 2002).
DATA SET C City Albuquerque Baltimore Bismarck Buffalo Charleston Charlotte Cheyenne Chicago Cleveland Concord Detroit Duluth
December Heating Degree-Days for Selected U.S. Cities (n Heating
Degree-Days 911 884 1,538 1,122 871 694 1,107 1,156 1,051 1,256 1,132 1,587
City
Degree-Days
El Paso Hartford Honolulu Indianapolis Jackson LosAngeles Miami Mobile Nashville NewOrleans Norfolk OklahomaCity
639 1,113 0 1,039 513 255 42 382 747 336 667 778
City Omaha Philadelphia Phoenix Providence SaltLakeCity SanFrancisco Seattle SiouxFalls St.Louis Washington,D.C. Wichita
=
35)
Degree-Days 1,172 915 368 1,014 1,076 490 744 1,404 955 809 949
Source: U.S. Bureau of the Census,Statistical Abstract of the United States.
Note: A degree-day is the sum over all days in the month of the difference between 65 degrees Fahrenheit and the daily mean temperature of each city.
Chapter 4 Descriptive Statistics
DATA SET D
Commercial Bank Profit as Percent of Revenue, 2003
Bank AmSouthBancorp BankofAmericaCorp. BankofNewYorkCo. Bank One Corp. BankNorthGroup BB&TCorp. Charter One Financial Citigroup Comerica CompassBancshares FifthThirdBancorp FirstTenn.Natl.Corp. FleetBoston HiberniaCorp. HuntingtonBancshares J.P.MorganChase KeyCorp M&T Bank Corp. Marshall&IlsleyCorp. MBNA
Percent
Banks
Bank
21 22 18 16 22 17 22 19 20 19 27 18 18 20 16 15 16 19 20 20
Percent
MellonFinancialGroup NationalCityCorp. NationalCommerceFinan North Fork Bancorp NorthernTrustCorp. PNCFinancialServicesGroup Popular Provident Financial Group Providian Financial RegionsFinancial SouthTrustCorp. StateSt.Corp. SunTrust Banks SynovusFinancialCorp. U.S.Bancorp UnionPlantersCorp. Wachovia Corp. Wells Fargo ZionsBancorp
15 22 20 31 16 17 18 6 7 18 23 13 19 16 24 21 19 20 18
Source: Fortune 149, no. 7 (April 5, 2004). Copyright © 2004 Time Inc. All rights reserved.
Note: These banks are in the Fortune 1000 companies.
DATA SET E
Caffeine Content of Randomly Selected Beverages (n Caffeine
Company/Brand Barq’s Root Beer Coca-Cola Classic CoolfromNestea Cool from Nestea Rasberry Cooler DietA&WCreamSoda DietAle8 Diet Code Red Diet Dr. Pepper Diet Inca Kola DietMountainDew DietMr.Pibb DietPepsi-Cola IncaKola KMX (Blue) MelloYelloCherry Mello Yello Melon
mg/oz. 1.83 2.83 1.33 0.50 1.83 3.67 4.42 3.42 3.08 4.58 3.33 3.00 3.08 0.00 4.25 4.25
Source: National Soft Drink Association w ( ww.nsda.org).
Company/Brand Mountain Dew Mr. Pibb NesteaEarlGrey Nestea Peach NesteaRasberry NesteaSweet Pepsi One RC Edge Royal Crown Cola SnappleDietPeachTea SnappleLemonTea SnappleLightning(BlackTea) SnappleSunTea Snapple Sweet Tea SunkistOrangeSoda Vanilla Coke
=
32) mg/oz. 4.58 3.33 4.17 1.33 1.33 2.17 4.58 5.85 3.60 2.63 2.63 1.75 0.63 1.00 3.42 2.83
167
168
Applied Statistics in Business and Economics
DATA SET F
Super Bowl Scores 1967–2005 (n
Year
TeamsandScores
1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987
Green Bay35,KansasCity10 Green Bay33,Oakland14 NYJets16,Baltimore7 KansasCity23,Minnesota7 Baltimore16,Dallas13 Dallas24,Miami3 Miami14,Washington7 Miami24,Minnesota7 Pittsburgh16,Minnesota6 Pittsburgh21,Dallas17 Oakland32,Minnesota14 Dallas27,Denver10 Pittsburgh35,Dallas31 Pittsburgh31,LARams19 Oakland27,Philadelphia10 San Francisco26, Cincinnati 21 Washington27,Miami17 LA Raiders 38, Washington 9 SanFrancisco38,Miami16 Chicago46,New England10 NY Giants 39, Denver 20
=
41 games)
Year 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
SuperBowl
TeamsandScores Washington42,Denver10 SanFrancisco20,Cincinnati16 SanFrancisco55,Denver10 NYGiants20,Buffalo19 Washington37,Buffalo24 Dallas52,Buffalo17 Dallas30,Buffalo13 SanFrancisco49,SanDiego26 Dallas27,Pittsburgh17 GreenBay35,NewEngland21 Denver31,GreenBay24 Denver34,Atlanta19 St.Louis23,Tennessee16 Baltimore34,NewYork7 New England20,St.Louis 17 TampaBay 48, Oakland 21 NewEngland32,Carolina29 New England 24, Philadelphia21 Pittsburgh21,Seattle10 Indianapolis29,Chicago17
Source: Sports Illustrated 2004 Sports Almanac, Detroit Free Press, and www.cbs.sportsline.com.
DATA SET G City and State Albuquerque,NM Anaheim,CA Anchorage,AK Arlington,TX Atlanta,GA Aurora,CO Austin,TX Birmingham,AL Boston,MA Buffalo,NY Charlotte,NC Chicago,IL Cincinnati,OH Cleveland,OH ColoradoSprings,CO Columbus,OH CorpusChristi,TX Dallas,TX Denver,CO Detroit,MI ElPaso,TX FortWorth,TX Fresno,CA
Property Crimes Per 100,000 Residents (n
=
68 cities)
Crime
City and State
Crime
City and State
8,515 2,827 4,370 5,615
Honolulu,HI Houston,TX Indianapolis,IN Jacksonville,FL
4,671 6,084 4,306 6,118
Phoenix,AZ Pittsburgh,PA Portland,OR Raleigh,NC
10,759 5,079 6,406 7,030 4,986 5,791 7,484 6,333 5,694 5,528 4,665 8,247 6,266 8,201 4,685 8,163 5,106 6,636 6,145
KansasCity,MO LasVegas,NV Lexington,KY LongBeach,CA LosAngeles,CA Louisville,KY Memphis,TN Mesa,AZ Miami,FL Milwaukee,WI Minneapolis,MN Nashville,TN NewOrleans,LA NewYork,NY Newark,NJ Oakland,CA OklahomaCity,OK Omaha,NE Philadelphia,PA
Source: Statistical Abstract of the United States, 2002.
9,882 4,520 5,315 3,407 3,306 5,102 6,958 5,590 8,619 6,886 7,247 7,276 6,404 2,968 6,068 6,820 8,464 5,809 5,687
Riverside,CA Sacramento,CA SanAntonio,TX SanDiego,CA SanFrancisco,CA SanJose,CA SantaAna,CA Seattle,WA St.Louis,MO St.Paul,MN Stockton,CA Tampa,FL Toledo,OH Tucson,AZ Tulsa,OK VirginiaBeach,VA Washington,DC Wichita,KS
Crime Crime 6,888 5,246 6,897 6,327 3,610 5,859 6,232 3,405 4,859 2,363 3,008 8,397 11,765 6,215 5,638 8,675 6,721 8,079 6,234 3,438 6,434 5,733
Chapter 4 Descriptive Statistics
DATA SET H
Size of Whole Foods Stores (n
=
171)
169
WholeFoods
Location (Store Name)
Sq. Ft.
Albuquerque,NM(Academy) Alexandria,VA(Annandale) AnnArbor,MI(Washtenaw)
33,000 29,811 51,300
…
… Winter Park, FL WoodlandHills,CA Wynnewood, PA
20,909 28,180 14,000
Source: www.wholefoodsmarket.com/stores/.Note: 165 stores omitted for brevity - see data file.
RELATED READING
Barnett, Vic; and Toby Lewis. Outliers in Statistical Data. 3rd ed. John Wiley and Sons, 1994. Blyth, C. R. “Minimizing the Sum of Absolute Deviations.” The American Statistician 44, no. 4 (November 1990), p. 329. Freund, John E.; and Benjamin M. Perles. “A New Look at Quartil es of Ungrouped Data .”The American Statistician 41, no. 3 (August 1987), pp. 200–203. Hoaglin, David C.; Frederick Mosteller; and John W. Tukey. Understanding Robust and Exploratory Data Analysis. John Wiley and Sons, 1983. Pukelsheim, Friedrich. “The Three SigmaRule.”The American Statistician48, no. 2 (May 1994), pp. 88–91. Roderick, J. A.; A. Little; and Donald B. Rubin. Statistical Analysis with Missing Data. 2nd ed. John Wiley and Sons, 2002. Tukey, John W.Exploratory Data Analysis. Addison-Wesley, 1977.
LearningStats Unit 04
Describing Data
LS
LearningStats Unit 04 uses interesting data sets, samples, and simulations to illustrate the tools of data analysis. Your instructor may assign a specific project, but you can work on the others if they sound interesting. Topic
LearningStats Modules
Overview
Describing Data Using MegaStat Using Visual Statistics Using MINITAB
Descriptivestatistics
BasicStatistics Quartiles Box Plots Coefficient of Variation Grouped Data Stacked Data Skewness and Kurtosis
Casestudies
Brad’sBowlingScores Aircraft Cockpit Noise Batting Averages Bridget Jones’s Diary Sample Variation Sampling NYSE Stocks
Samplingmethods
SimpleRandomSampling Systematic Sampling Cluster Sampling
Studentprojects
CollegeTuition Per Capita Income
Formulas Key:
=P owerPoint
Table of Formulas Significant Digits =W ord
=E xcel
170
Applied Statistics in Business and Economics
Visual Statistics
VS
Visual Statistics is a software tool that is included on your CD, to be installed on your own computer. The CD will guide you through the installation process. Visual Statistics consists of 21 learning modules. Its purpose is to help you learn concepts on your own, through experimentation, individual learning exercises, and team projects. It consists of software with graphical displays, customized experiments, well-indexed help files (definitions, formulas, examples), and a complete textbook (in .PDF format). Each chapter has learning exercises (basic, intermediate, advanced), learning projects (individual, team), a self-evaluation quiz, a glossary of terms, and solutions.
Visual Statistics modules 1, 3, and 6 are designed to help you • Recognize and interpret different types of histograms (frequency, cumulative, relative). • Realize how histogram setup can affect one’s perception of the data. • Be able to visualize common shape measures (centrality, dispersion, skewness, kurtosis). • Recognize discrete and continuous random variables. • Learn to infer a population’s shape, mean, and standard deviation from a sample. • See how outliers affect histograms. Visual Statistics Modules on Describing Data Module
Module Name
1
VS
UnivariateDataAnalysis
3
VS
ShapesofDistributions
6
VS
Random Samples
Chapter 4 Descriptive Statistics
MegaStat for Excel by J. B. Orris of Butler University is an Excel add-in that is included on the CD, to be installed on your own computer. The CD will guide you through the installation process. MegaStat goes beyond Excel’s built-in statistical functions to offer a full range of statistical tools to help you analyze data, create graphs, and perform calculations. MegaStat examples are shown throughout this textbook.
After installing MegaStat, you should see MegaStat appear on the left when you click the AddIns tab on the top menu bar (right side in the illustration above). If not, click the Office icon in the upper left and then click the Excel Options button (yellow highlight in the illustration below). At the bottom, find Manage Excel Add-Ins, click the Go button, and check the MegaStat add-in box. The CD includes further instructions.
171
172
Applied Statistics in Business and Economics
EXAM REVIEW QUESTIONS FOR CHAPTERS 1– 4
1. Which type of statistic (descriptive, inferential) is each of the following? a. Estimating the default rate on all U.S. mortgages from a random sample of 500 loans. b. Reporting the percent of students in your statistics class who use Verizon. c. Using a sample of 50 iPhones to predict the average battery life in typical usage. 2. Which is not an ethical obligation of a statistician? Explain. a. To know and follow accepted proce dures. b. To ensure data integrity and accurate calculations. c. To support client wishes in drawing conclusions from the data. 3. “Driving without a seat belt is not risky. I’ve done it for 25 years without an accident.” Thisbest illustrates which fallacy? a. Unconscious bias. b. Conclusion from a small sample. c. Post hoc reasoning. 4. Which data type (categorical, numerical) is each of the follo wing? a. Your current credit card ba lance. b. Your college major. c. Your car’s odometer mileage reading today. 5. Give the type of measurement (nominal, ordinal, interval, ratio) for each variable. a. Length of time required for a randomly-chosen vehicle to cross a toll bridge. b. Student’s ranking of five cell phone service providers. c. The type of charge card used by a customer (Visa, MasterCard, AmEx, Other). 6. Tell if each variable is continuous or discrete. a. Tonnage carried by an oil tan ker at sea. b. Wind velocity at 7 o’clock this morning. c. Number of text messages you received yesterday. 7. To choose a sample of 12 students from a statistics class of 36 students, which type of sample (simple random, systematic, cluster, convenience) is each of these? a. Picking every student who was wearing blue that day. b. Using Excel’s =RANDBETWEEN(1,36) to choose students from the class list. c. Selecting every 3rd student starting from a randomly-chosen position. 8. Which of the following is not a reason for sampling? Explain. a. The destructive nature of some tests. b. High cost of studying the entire population. c. Bias inherent in Excel’s random numbers. 9. Which statement is correct? Why not the others? a. Likert scales are interval if scale distanc es are meaningful. b. Cross-sectional data are measured over time. c. A census is always preferable to a sample. 10. Which statement is false? Explain. a. Sampling error can be reduced b y using appropriate data coding. b. Selection bias means that respondents are not typical of the target population. c. Simple random sampling requires a list of the popula tion. 11. The management of a theme park obtained a random sample of the ages of 36 riders of its Space Adventure Simulator. (a) Make a nice histogram. (b) Did your histogram follow Sturges’Rule? If not, why not? (c) Describe the distribution of sample data. (d) Make a dot plot of the data. (e) What can be learned from each display (dot plot and histogram)? 39
46
15
38
39
47
50
61
17
40
54
36
16
18
34
42
10
16
16
13
38
14
16
56
17
18
24
17
12
21
8
18
12. Which one of t he following is true? Why not the others? a. Histograms are useful for visualizing correlation s. b. Pyramid charts are general ly preferred to bar charts. c. A correlation coefficient can be negative.
13
13
53 10
Chapter 4 Descriptive Statistics
13. Which data would be most suitable for a pie chart ? Why not the others? a. Presidential vote in the last election b y party (Democrati c, Republican, Other). b. Retail prices of six major brands of color laser print ers. c. Labor cost per vehicle for te n major world automakers. 14. Find the mean, standard deviation, and coefficient of variation forX
= 5, 10, 20, 10, 15.
15. Here are the ages of a rand om sample of 20 CE Os of Fortune 500 U.S. corporations. (a) Find the mean, median, and mode. (b) Discuss advantages and disadvantages of each of these measures of central tendency for this data set. (c) Find the quartiles and interpret them. (d) Sketch a boxplot and describe it. Source: http://www.forbes.com 57
56
58
46
70
62
55
60
59
64
62
67
61
55
53
58
63
51
52
77
16. A consulting firm used a random sample of 12 CIOs (Ch ief Informatio n Officers) of large businesses to examine the relationship (if any) between salary and years service in the firm. (a) Make a scatter plot and describe it. (b) Calculate a correlation coefficient and interpret it. Ye ar s( X )
4
S al a ry ( Y )
1 33
15 12 9
15
8 14 3
11 13 2
5
5
144
61
8 128
10 79
1 140
6
17 116
88
170
17. Which statement is true? Why not the others? a. We expect the median to exceed the mean in positively-skewed data. b. The geometric mean is not helpful when there are negative data values. c. The midrange is resistant to outliers. 18. Which statement is false? Explain. a. If µ 52 and 15, then X 81 would be an outlier. b. If the data are from a normal populati on, about 68% of the values will be within c. If µ 640 and 128 then the coefficient of variation is 20 percent.
= =
=
=
=
± .
19. Which is a not characteristic of using a log scale to display time series data? Explain. a. A log scale helps if we are comp aring changes in tw o time series of dissim ilar magnitude. b. General business audiences find it easier to interpret a log scale. c. If you display data on a log scale, equal distance s represent equa l ratios.
173