Introducing R to Demographers with applications for Survival Analysis Roland Rau1 Max Planck Institute for Demographic Research Konrad Zuse Str. 1 D–18057 Rostock Germany 1st February 2005
1 Tel.
+49-381-2081109; email:
[email protected]
2
c 2004 Roland Rau Copyright °
Contents I
Learning the Basics of R by Examples
11
0 Preliminaries
13
1 Aim of this Document
15
2 What is R? 2.1 Is it worth learning a new language / stats package? . . . . . . . . 2.2 The pros of R: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The cons of R: . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17 17 17 17
3 First Interactive Steps 3.1 A Powerful Pocket Calculator 3.2 Assignments . . . . . . . . . . 3.3 Functions . . . . . . . . . . . 3.4 Graphs . . . . . . . . . . . . . 3.5 Getting help . . . . . . . . . . 3.6 Using an Editor . . . . . . . . 3.7 Exercises . . . . . . . . . . .
. . . . . . .
19 19 20 21 22 26 28 29
. . . . . . . . . .
31 31 31 34 36 36 38 40 41 43 44
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
4 The Representation of Data in R 4.1 Basic Units: Vectors . . . . . . . . . . . . . . . 4.1.1 Generating Data & Types of Data . . . 4.1.2 Referencing Elements of A Vector . . . 4.2 Matrices, Arrays, Dataframes, Lists . . . . . . . 4.2.1 Matrices . . . . . . . . . . . . . . . . . . 4.2.2 Arrays . . . . . . . . . . . . . . . . . . . 4.2.3 Dataframes . . . . . . . . . . . . . . . . 4.2.4 Lists . . . . . . . . . . . . . . . . . . . . 4.3 Missing values and other odd numerical values 4.4 Requesting Information About Data Structures
3
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
4
CONTENTS 4.5
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
5 Distributions in R 5.1 Introduction . . . . . . . . . . . . . . . . . . . . 5.2 (Cumulative) Distribution Function p . . . . . 5.3 Density Function d . . . . . . . . . . . . . . . . 5.4 Quantile Function q . . . . . . . . . . . . . . . 5.5 Generation of Random Numbers r . . . . . . . 5.5.1 Digression . . . . . . . . . . . . . . . . . 5.6 Overview of Functions . . . . . . . . . . . . . . 5.7 The Gompertz Distribution . . . . . . . . . . . 5.7.1 Introduction . . . . . . . . . . . . . . . 5.7.2 Hazard Function . . . . . . . . . . . . . 5.7.3 Survivor Function . . . . . . . . . . . . 5.7.4 Cumulative Hazard Function . . . . . . 5.7.5 Density Function . . . . . . . . . . . . . 5.7.6 Cumulative Distribution Function . . . 5.7.7 Quantile Function . . . . . . . . . . . . 5.7.8 Generation of Random Numbers . . . . 5.8 Generating Random Numbers from Given Data 5.9 Further Reading . . . . . . . . . . . . . . . . . 5.10 Exercises . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . — sample . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
47 47 47 48 51 51 51 52 52 52 54 54 54 54 55 55 55 56 58 58
6 Data Input/Output 6.1 Reading Data . . . . . . . . . . . . . . 6.1.1 Introduction . . . . . . . . . . 6.1.2 Setting the Working Directory 6.1.3 Reading Text Data . . . . . . . 6.1.4 Reading Binary Data . . . . . 6.2 Data Editor . . . . . . . . . . . . . . . 6.3 Writing Data . . . . . . . . . . . . . . 6.4 Further Reading . . . . . . . . . . . . 6.5 Exercises . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
59 59 59 59 60 61 61 62 62 62
7 Programming with R 7.1 Introduction . . . . . . . . . . 7.2 Grouping Expressions . . . . 7.3 Flow-Control . . . . . . . . . 7.3.1 Conditional Execution 7.3.2 Repetitive Execution . 7.4 Writing Functions . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
63 63 63 64 64 66 68
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
CONTENTS 7.5 7.6 7.7
5
Problems with loops . . . . . . . . . . . . . . . . . . . . . . . . . . Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 Graphing Data 8.1 Introduction . . . . . . . . . . 8.2 Basic plotting function: plot 8.3 Histograms — hist . . . . . 8.4 Barplot - barplot . . . . . . 8.5 Making boxplots — boxplot 8.6 QQ-Plots qqplot / qqnorm . 8.7 Further Plotting Commands . 8.8 Further Reading . . . . . . . 8.9 Exercises . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
72 75 75 77 77 77 80 82 84 86 86 93 93
9 Simple Statistical Models 95 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 9.2 Plotting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 9.3 Estimating the Linear Model and Accessing the Components of the Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 9.4 Regression Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . 99 9.5 Technical Digression: using matrix language to estimate the coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 9.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 10 More Useful Functions 10.1 Crosstabulations using table . . . . . . . . 10.2 How many different values exist? — unique 10.3 Splitting Data — split and cut . . . . . . 10.4 Sorting Data — sort and order . . . . . . 10.4.1 Sorting by One Variable . . . . . . . 10.4.2 Sorting by More Than One Variable
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
103 103 104 105 106 106 107
11 Further Reading: 11.1 Background on R . . 11.2 Introductions . . . . 11.3 Graphics . . . . . . . 11.4 Programming in R/S 11.5 . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
109 109 109 109 109 109
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
6
II
CONTENTS
Using R for Survival Analysis
111
List of Figures 3.1 3.2 3.3 3.4
A first (nonsense) point of plots . . . . . . . . . . . . . . . . . . . . An example of a lineplot . . . . . . . . . . . . . . . . . . . . . . . . An example of a histogram with ∼100 bins for X ∼ N (µ = 0, σ = 1) How to access the R manuals and other helpful tools . . . . . . . .
23 24 25 27
4.1
Understanding and Referencing an Array . . . . . . . . . . . . . .
39
5.1 5.2
Density of A Standard Normal Distribution . . . . . . . . . . . . . Overlapping Gompertz Random Numbers with the Corresponding Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
Constructing a Plot Piece By Piece . . . . . . . . . . . . . . . . . . How to plot a histogram . . . . . . . . . . . . . . . . . . . . . . . . Barplot for Women and Men Dying From Lung Diseases in the UK Constructing a Boxplot for Weights of Chickens (left) and Ages at Death of the Swedish Birth Cohort from 1890 (right) . . . . . . . . An Example of a QQ-Plot . . . . . . . . . . . . . . . . . . . . . . . Displaying Data in Histogram-like Plot . . . . . . . . . . . . . . . . Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Relating Histograms with Boxplots . . . . . . . . . . . . . . . . . .
79 81 83
8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 9.1 9.2
57
85 87 89 90 94
Is there a linear relationship between height and weight? - A first glance at the data . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Regression Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . 101
7
8
LIST OF FIGURES
List of Tables 5.1
Overview of Built-In Distributions . . . . . . . . . . . . . . . . . .
9
53
10
LIST OF TABLES
Part I Learning the Basics of R by Examples
11
Chapter 0 Preliminaries The current version of this document represents a first draft; the introduction for the first part still has to be written and the (existing) notes for Survival Analysis (second part) have to be put together and included. In addition, I am planning to improve also the existing chapters. If you have any suggestions, comments, . . . , I am grateful for any feedback. You can reach me via email at
[email protected] Please note that the electronic version of this document has been prepared in a way that one can navigate through chapters, figures and tables just by clicking them with the mouse. The extra visualization is only present in the electronic document; when printed the clickable cross-references can not be distinguished from other parts. This document has been prepared using:
> version platform arch os system status major minor year month
_ i386-pc-mingw32 i386 mingw32 i386, mingw32 2 0.1 2004 11
13
14 day 15 language R
CHAPTER 0. PRELIMINARIES
Chapter 1 Aim of this Document The document is divided into two parts. • The first part presents a smooth introduction to R where the main features of the R language are presented. It starts with some first interactive steps to get a first impression for the language. Then, each subsequent chapter gives an overview for a specific topic: representation of data, distributions in R, reading and writing data, basic programming, and graphing data. The last chapter of the first part introduces how statistical analyses are performed in R using simple examples and linear regression. • The second part has to be still written. It will heavily rely on the weekly lab sessions I have given during the Winter Semester 2004/2005 at the “International Max Planck Research School for Demography” to supplement the course “Survival Analysis” taught by Jutta Gampe and Francesco Lagona. For each lab-session, I have already written roughly 8–15 pages. The topic which will be covered are (not the final order): – defining survival objects – setting up the likelihood-functions – non-parametric estimation (Kaplan-Meier) – comparing survival curves (logrank-test) – the Cox-Model – time-varying covariates – parametric survival models (Weibull and Gompertz; incl. writing down the likelihood-function for data which are left-truncated and rightcensored; optimizing the function; finding standard errors for the parameter estimates and estimating confidence bands)
15
16
CHAPTER 1. AIM OF THIS DOCUMENT – Regression diagnostics: Cox-Snell Residuals, Martingale Residuals; ⇒ checking the fit of the model, checking the functional form of the covariates, checking proportionality assumption – discrete-time survival models (logistic regression, embedded into generalized linear models (GLMs)) – (something still has to be written about accelerated failure models (AFT))
Chapter 2 What is R? 2.1
Is it worth learning a new language / stats package?
2.2
The pros of R:
2.3
The cons of R:
17
18
CHAPTER 2. WHAT IS R?
Chapter 3 First Interactive Steps 3.1
A Powerful Pocket Calculator
R is most likely a new environment for most of you. So we want to make the start relatively smooth. You can think of R as a powerful pocket calculator. You simply enter arithmetic expressions and R will return the result for you.
> 2 + 3 [1] 5 > 4 * 15 [1] 60 > 2 - 3 [1] -1 > 4^5 [1] 1024 > 1e+06 - 4^3 [1] 1e+06 > (log(27) - (23^75))/(-34.6) [1] 3.895e+100
19
20
CHAPTER 3. FIRST INTERACTIVE STEPS
Actually, R allows you also to work with complex numbers. You enter them as shown in the following example. But it will not be required in our course to work with complex numbers.
> 27 + (0+5i) [1] 27+5i
3.2
Assignments
So far, R is just an overgrown calculator. If this was all, R would be far from being the powerful program it actually is. We will therefore introduce more and more features in the next sections. One important feature of any programming language is the possibility to assign values to variables. Of course, this is also possible in R. Indeed, there are 2 syntactically correct (and equivalent) ways to assign values to a variable. Either you use = like in most other programming languages or you type <-. If you enter the variable then, it will return its value. No limit is given on the length of variable names like in SPSS or SAS. It is recommended that you use only alphanumeric symbols for your variable names (A–Z, a–z, 0–9) to avoid confusion with symbols which have a special meaning in R like ! or $. Compared to most other programs on Windows R does make a difference between “Hello”, “hello”, and “HELLO”. A few examples will clarify the use of assignments.
> n = 10 > N <- 100 > n [1] 10 > N [1] 100 > > > >
m = 81 c = 299792458 E = m * c^2 E
[1] 7.28e+18 It should be noted to all “immigrants” from S-Plus that the assignment operator _ does not work anymore in R (since release 1.8.0). The assignment myvalue_4 is hence invalid.
3.3. FUNCTIONS
3.3
21
Functions
R is strongly related to the “functional programming” paradigm. This means that the basic building blocks of the language are functions which are evaluated. The basic idea is that you give the name of a function and then in parentheses the respective arguments. An introductory example is the usage of the runif-function. This function returns as many uniformly distributed random numbers between 0 and 1 as you request with the given argument.
> runif(5) [1] 0.9363 0.3241 0.3239 0.9250 0.7919 > runif(10) [1] 0.86220 0.37548 0.58685 0.03471 0.24584 0.14294 [7] 0.55058 0.91449 0.52005 0.17364 R is not restricted to one parameter. Many parameters can be passed to a function. Although the sequence of arguments/parameters is enough for the computer to find out which argument has which meaning, it is better for the user to give the names of the arguments when you are looking at code at a later point of time. If you do that, you can even change the order of the arguments given.
> runif(10, 1, 100) [1] 72.021 1.091 40.617 18.549 5.403 52.899 48.013 [8] 45.676 95.145 45.116 > runif(10, min = 1, max = 50) [1] 33.919 16.578 47.236 23.250 13.579 12.450 16.515 [8] 11.909 5.535 16.857 > runif(10, max = 50, min = 1) [1] 13.781 26.049 28.122 26.058 [8] 11.002 20.158 25.070
2.002 14.964
1.857
In those three examples above we always asked for 10 uniformly distributed random numbers. In the first case, the minimum is 1, the maximum is 100. In the second case, we asked again for 10 uniformly distributed random numbers. The range in this instance was from 1 to 50 as indicated by the two arguments min= and max=.
22
CHAPTER 3. FIRST INTERACTIVE STEPS
The third example requests the same as example 2. By giving the names of the arguments, we were able to change the order of them. Please note that R allows to nest functions as shown in the following example when 100,000 uniformly distributed random numbers are generated and their mean is calculated.
> mean(runif(1e+05)) [1] 0.4999
3.4
Graphs
One of the major advantages of R are its outstanding capability of displaying data in graphs. Basically, you have control over every little tick-mark you are setting. In those first few steps only some very basic plotting commands are introduced. They should simply show how you can make plots of your data. Figure 3.1 shows a first plot of uniformly distributed random numbers. This plot of points has been produced by this code. Admittedly, it does not make much sense.
> plot(runif(100)) Also the next plot is not too meaningful (Fig. 3.2). Instead of a point-plot we produced a line-plot in red by adding two arguments.
> plot(runif(100), type = "l", col = "red") More meaningful is probably a histogram for such data. Besides the new hist function also another unknown function is introduced: rnorm. Without giving any additional arguments, it simply draws as many random numbers as requested from a standard normal distribution (X ∼ N (µ = 0, σ = 1)).
> hist(rnorm(1e+05), breaks = 100)
3.4. GRAPHS
23
0.6 0.4 0.2 0.0
runif(100)
0.8
1.0
Figure 3.1: A first (nonsense) point of plots
0
20
40
60 Index
80
100
24
CHAPTER 3. FIRST INTERACTIVE STEPS
0.6 0.4 0.2 0.0
runif(100)
0.8
1.0
Figure 3.2: An example of a lineplot
0
20
40
60 Index
80
100
3.4. GRAPHS
25
2000 1000 0
Frequency
3000
4000
Figure 3.3: An example of a histogram with ∼100 bins for X ∼ N (µ = 0, σ = 1)
−4
−2
0 rnorm(1e+05)
2
4
26
CHAPTER 3. FIRST INTERACTIVE STEPS
3.5
Getting help When I was younger, so much younger than today. I never needed anybody’s help in any way. . . .
This small tutorial can not cover all aspects of the R language. Hence, it is necessary for you all to find the things you are looking for. It should be stressed that R is extremely user friendly - it just may seem in the beginning that it is very selective who its friends are. But after a while you will agree with the statement that hardly any statistical package provides as much help as R does. The next few paragraphs show you how you should proceed if are looking for something in R. • If you know already the name of command (which is a function in R) you simply ask for help by typing:
> help(mean) • If you do not know the respective function, you can search across all functions.
> help.search("median") With this output you know that there is a function called median in the package stats which calculates the Median Value. Then you can proceed by loading the package (actually, the stats package is loaded automatically) and use the help function again.
3.5. GETTING HELP
27
Figure 3.4: How to access the R manuals and other helpful tools
> library(stats) > help(median) • An important source of information is the included documentation. R ships with several manuals which are extremely useful. You can¨access ¥ them if you click in RGui (Gui = R Graphical User Interface) on Help and then ¤
¡
§
¦
on £Manuals ¢ as indicated in Picture 3.4. Among them, the manual “An Introduction to R” by Venables, Smith and the R Development Core Team is especially useful for Beginners. • I¨would ¥ like to point at two more valuable sources of information in the same Help menu: FAQ on R and FAQ on R for Windows answer “Frequently § ¦ Asked Questions”.
• R has also a help mailing list. You can subscribe to that list via: https://stat.ethz.ch/mailman/listin This mailing list is very fast and also very helpful. But it can be also very unfriendly if you do not read the posting guide first. You can read the posting guide at http://www.r-project.org/posting-guide.html. The main
28
CHAPTER 3. FIRST INTERACTIVE STEPS point is: “Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response [. . . ]. – Do help.search("keyword") with different keywords (type this at the R prompt) – Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) – If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it.
– Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.o – Read at least the relevant section in An Introduction to R – If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting.” Very helpful in that respect is also the essay “ How To Ask Questions The Smart Way” written by Eric S. Raymond. It is online available at http://www.catb.org/esr/faqs/smart-questions.html. • One point which is not included but turned out to be very useful for me are the searchable help archives. They are located at http://maths.newcastle.edu.au/rking/R/. The main lesson I’ve learned there was: my problem is not original. At least in 9 out of 10 cases, somebody else already asked this question before. • In addition, there is, of course, also the possibility to ask Jutta Gampe, Francesco Lagona, me or other R users at the MPIDR. But please do the recommended help, help.search, ... procedures when you are asking questions to us, too. The best way is generally (also if you write the help mailing list) to provide a little piece of code to have a generalized example of your problem.
3.6
Using an Editor
Code for R does usually not consist of only one line. It is therefore a good choice to write your code in a text editor and then run the whole file or parts of it in the R Interpreter. We do not want to impose any editor on you. If you have experience with the Emacs family, then you are most welcome to use it. The editor which has been installed for the course is WinEdt with a special plugin for R. WinEdt
3.7. EXERCISES
29
is the editor many people use here at the institute to write their R code. It has several features which make it easy to learn and easy to use, for example syntax highlighting, submitting code directly from the editor to R, . . . . The Crimson editor offers also syntax highlighting as several other editors. Although they are also working, we do not recommend the editors WordPad and Notepad which are 1 part of every Windows distribution. ¥ One possibility to run the code you have ¤ ¡ ¨ written is by doing copy & paste . The other possibility is to “source” the file. £ ¢ § ¦ An important command for that approach is setwd() which sets your working directory. Please note that on Windows platforms you need the double backslash (\\) as in setwd("u:\\mypathto\\myworkingdirectory\\"). You can then load the file with the corresponding R-code by typing source("mycodefile.r").2 This let’s R run the given file line by line until the end of the file. If it encounters errors, it gives you warning messages.
3.7
1
Exercises
No, Microsoft Word is not an editor. It is not necessary to give the code file an extension with *.r. Nevertheless, several editors detect which kind of file (R, SAS, Stata, C, C++, . . . ) is used for syntax highlighting and other useful tools by checking the extension. 2
30
CHAPTER 3. FIRST INTERACTIVE STEPS
Chapter 4 The Representation of Data in R 4.1
Basic Units: Vectors
After these appetizers, we will now dive a bit deeper into the R language. It has already been mentioned that the basic units of manipulating data are functions. The basic units of those data are vectors. Don’t worry: it does not hurt. Actually, you will soon find out that using vectors facilitates working with data a lot.
4.1.1
Generating Data & Types of Data
The first thing you will learn in this section how you generate these data as a vector. R has very powerful built-in functions to generate (regular) sequences of data. The easiest thing is to concatenate the various elements after each other by using the c-command.
> c(1, 2, 3, 4, 5, 6, 7, 8, 9) [1] 1 2 3 4 5 6 7 8 9 > > + > > >
myfirstdata = c(1, 2, 3, 4, 5, 6, 7, 8, 9) mynextdata = c(234, 456.56, 435, 435, 56, 547) options(digits = 3) moredata = c(myfirstdata, mynextdata, runif(6)) moredata
31
32
CHAPTER 4. THE REPRESENTATION OF DATA IN R
[1] 1.0000 2.0000 3.0000 4.0000 5.0000 [6] 6.0000 7.0000 8.0000 9.0000 234.0000 [11] 456.5600 435.0000 435.0000 56.0000 547.0000 [16] 0.0733 0.0403 0.7021 0.0127 0.9119 [21] 0.8008 Numbers are internally stored as floating point numbers (numbers with a decimal point). If you want n digits in your output — it does not affect the numbers of digits stored — use the command options(digits=n).1 Numbers can be stored on three different scale levels - just as you know it from your stats lessons. By default, numbers are represented on a metric scale. Categorical data, like religion, sex, . . . are represented as factors in R.
> sex <- factor(c(1, 2, 2, 1, 2, 1, )) > sex [1] 1 2 2 1 2 1 Levels: 1 2 If you enter factors into a regression model (which will be shown later), R knows that it can not be treated like a typical numerical value. Data measured on an ordinal scale are can be entered into R in two different ways:
> ratings <- factor(c(1, 2, 2, 5, 3, 1, 5, 6), + ordered = TRUE) > ratings [1] 1 2 2 5 3 1 5 6 Levels: 1 < 2 < 3 < 5 < 6 > ratings <- ordered(c(1, 2, 2, 5, 3, 1, 5, + 6)) > ratings [1] 1 2 2 5 3 1 5 6 Levels: 1 < 2 < 3 < 5 < 6 Besides numeric data, R allows you also to use vectors which consist of character strings. You simply have to enclose the character string by quotation marks like in the following example. 1
For the technically interested folks: The core part of R is coded in C. The default numeric type of any stored number is double.
4.1. BASIC UNITS: VECTORS
33
> simpsons <- c("Marge", "Homer", "Bart", "Lisa", + "Maggie") > simpsons [1] "Marge"
"Homer" "Bart"
"Lisa"
"Maggie"
The vector myfirstdata was very tedious to enter by hand. At least, if you want to have a longer regular sequence of integers. Then the function seq is quite handy. Some examples will show you its usage.
> 1:10 [1] 1 2
3 4
5 6
7 8
9 10
3 4
5 6
7 8
9 10
> seq(1:10) [1] 1 2
> seq(from = 1, to = 10, by = 1) [1] 1 2
3 4
5 6
7 8
9 10
> seq(from = 1, to = 10, by = 2) [1] 1 3 5 7 9 > seq(from = 10, to = 1, by = -3) [1] 10
7 4
1
> seq(from = 1, to = 20, length = 14) [1] 1.00 2.46 3.92 5.38 6.85 8.31 [9] 12.69 14.15 15.62 17.08 18.54 20.00
9.77 11.23
Another command is the function rep(x,n) which repeats the data x as often as indicated by the integern. You will see in the following examples that x does not have to be scalar. It can be also a vector. If n is not a scalar it needs to have as many elements as x.
> rep(1, 5) [1] 1 1 1 1 1
34
CHAPTER 4. THE REPRESENTATION OF DATA IN R
> rep(c(1, 2), 3) [1] 1 2 1 2 1 2 > rep(c(runif(4), c(235.32, 325.432), seq(from = 1, + to = 9999, length = 3)), 3) [1] [6] [11] [16] [21] [26]
3.56e-01 3.25e+02 3.84e-01 1.00e+00 8.51e-02 5.00e+03
3.84e-01 1.00e+00 8.51e-02 5.00e+03 5.46e-01 1.00e+04
8.51e-02 5.00e+03 5.46e-01 1.00e+04 2.35e+02
5.46e-01 1.00e+04 2.35e+02 3.56e-01 3.25e+02
2.35e+02 3.56e-01 3.25e+02 3.84e-01 1.00e+00
> rep(1:3, 1:3) [1] 1 2 2 3 3 3
4.1.2
Referencing Elements of A Vector
A very useful feature of R is the possibility to reference elements of a vector. You are maybe familiar with this concept if you have used the commands vector in SPSS or array in SAS. Given we have a vector generated as in the section like:
> mydata <- rep(1:10, 2) > mydata [1] 1 2 3 4 [18] 8 9 10
5 6
7 8
9 10
1 2
3 4
5 6
7
By using braces like [] R allows us to reference parts of this vector. The 15th element of this vector is returned when you enter
> mydata[15] [1] 5 You can also do more sophisticated selection. Imagine you want to select all elements which are larger than 0 from a vector consisting of 20 random numbers drawn from a standard normal distribution.
> mynewdata <- rnorm(20) > mynewdata
4.1. BASIC UNITS: VECTORS [1] [7] [13] [19]
35
-1.0636 -0.5084 -0.2901 -0.3252 1.0546 1.0711 1.4737 0.6861 -0.1877 0.5322 0.6785 -0.8432 0.0514 1.3047 0.5731 -0.2192 1.4456 -0.5853 1.6863 -1.2310
An intuitive way would probably to write mynewdata[> 0]. But you see that this produces a Syntax Error. The solution is what is actually evaluated in those brackets: the expression given there has to be TRUE in a logical sense. Numbers like 12, 35, or 5 are always TRUE. We have, however, written a comparison by saying greater than zero. But what should be greater than zero? Yes, correct, our given data should be greater than zero. Thus, the solution is:
> mynewdata[mynewdata > 0] [1] 1.0546 1.0711 1.4737 0.6861 0.5322 0.6785 0.0514 [8] 1.3047 0.5731 1.4456 1.6863 The trick is that R evaluates in this case all values of mynewdata and returns the values TRUE and FALSE for each element if this vector.
> mynewdata > 0 [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE [9] FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE [17] TRUE FALSE TRUE FALSE Each TRUE element is then selected. This seems like it requires more work than actually necessary. That is true in such a simple example. But it also gives you great flexibility. Given we have two vectors. One is sex of a an individual (1=male, 2=female) and the other is age at death. Now we want to know the mean age at death for women and for men separately. As you will see in the following codeexample, this is easy, because you can give the selection criterion also of another variable.
> sex <- c(1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1) > age <- seq(from = 75, to = 95, length = length(sex)) > mean(age[sex == 1]) [1] 87.3 > mean(age[sex == 2]) [1] 81
36
CHAPTER 4. THE REPRESENTATION OF DATA IN R
4.2
Matrices, Arrays, Dataframes, Lists
Vectors are the basic building blocks for data in R. Vectors can be combined into a matrix, an array, dataframes and/or lists. The following sections will outline their respective characteristics.
4.2.1
Matrices
A matrix can be regarded as a collection of vectors of the same length (i.e. the same number of elements) and of the same type (numeric values vs. character values). Several ways exist to construct a matrix. Here, I would like to address only two possibilities using the previously defined vectors sex and age.2
> myfirstmatrix <- as.matrix(cbind(sex, age)) > myfirstmatrix [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,]
sex 1 2 2 2 1 1 2 1 1 1 1
age 75 77 79 81 83 85 87 89 91 93 95
> mysecondmatrix <- as.matrix(rbind(sex, age)) > mysecondmatrix sex age sex age
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] 1 2 2 2 1 1 2 1 1 1 75 77 79 81 83 85 87 89 91 93 [,11] 1 95
Yes, there are three examples. But the first two differ only by the functions cbind and rbind. The meaning of them is to bind vectors by row or by column. 2
4.2. MATRICES, ARRAYS, DATAFRAMES, LISTS
37
> mythirdmatrix <- matrix(c(sex, age), nrow = length(sex), + ncol = 2, byrow = FALSE) > mythirdmatrix [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,]
[,1] [,2] 1 75 2 77 2 79 2 81 1 83 1 85 2 87 1 89 1 91 1 93 1 95
Having data defined as a matrix facilitates the analysis in several ways. For example in a regression setting (y = Xβ +²) you can enter a matrix of covariates at once without referring to each single vector/variable. In addition, R has also built-in functions which work on a matrices. These features will be briefly introduction in Chapter 9 starting on page 95. Referencing elements in matrices works similar to referencing elements of vectors. The only difference is that we have data not only in one but in two dimensions. Consequently, two indices have to be given. The following rule applies if you want to access a single element of a matrix: mymatrix[row-number, column-number]. For example, if we want to have the element in the second column and the fourth row we have to tell R:
> mythirdmatrix[4, 2] [1] 81 If you only want to have the second row or first column you have to write:
> mythirdmatrix[2, ] [1] 2 77 > mythirdmatrix[, 1] [1] 1 2 2 2 1 1 2 1 1 1 1
38
CHAPTER 4. THE REPRESENTATION OF DATA IN R
4.2.2
Arrays
The aforementioned matrices can be considered as a special case of an array. While matrices represent data in a two-dimensional way, arrays allow to have data in ndimensions.3 An example for a three-dimensional array could be a collection of data, where death rates are gathered by period and age (i.e. a Lexis diagram) for several countries. Arrays can be constructed with the array(data, dimensionvector) function. An example should clarify this approach. 32 uniform random numbers are put into a three-dimensional array with 2 rows, 8 columns and 2 “layers”.
> myarray <- array(runif(32), dim = c(2, 8, + 2)) > myarray , , 1 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 0.425 0.238 0.3571 0.559 0.099 0.223 0.788 0.391 [2,] 0.231 0.961 0.0784 0.333 0.362 0.296 0.761 0.565 , , 2 [1,] [2,] [1,] [2,]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] 0.6538 0.824 0.134 0.0759 0.475 0.0763 0.943 0.0871 0.939 0.312 0.3185 0.815 0.0768 0.553 [,8] 0.247 0.308
You can already see how you can reference the elements of an array by looking at the output from the previous entered code lines. Figure 4.1 on page 39 should make referencing and understanding arrays a bit simpler. While matrices need two indices to reference elements, arrays need as many indices as there are dimensions. In Figure 4.1 we have three dimensions. As previously shown, the first and second index refer to the rows and columns, respectively. The third index indicates which “layer” is meant.
3
As far as I know is the maximum number of dimensions in R n = 7.
4.2. MATRICES, ARRAYS, DATAFRAMES, LISTS
39
Figure 4.1: Understanding and Referencing an Array
NULL
[1,1,4]
[1,4,4]
[1,1,3]
[1,4,3]
[1,1,2] [1,1,1]
[1,4,2] [1,4,1]
[4,4,4] [4,4,3] [4,4,2]
[4,1,1]
[4,4,1]
40
CHAPTER 4. THE REPRESENTATION OF DATA IN R
The “upper, right” element on the second “layer” in such a scenario could therefore be referenced as:
> myarray <- array(runif(64), c(4, 4, 4)) > myarray[1, 4, 2] [1] 0.83
4.2.3
Dataframes
Dataframes represent data in a two-dimensional way like matrices. They are the closest representation of data in R as you may know it from packages like SPSS or STATA. That means that one row typically represents the values of several variables for one individual/unit. In addition, it is also possible to assign names to each row. See the following example how to construct a dataframe:
> + > > > + >
simpsons <- c("Marge", "Homer", "Bart", "Lisa", "Maggie") ages <- c(34, 38, 10, 8, 0) iq <- c(100, 70, 80, 140, NA) mydataframe <- data.frame(cbind(ages, iq), row.names = simpsons) mydataframe
Marge Homer Bart Lisa Maggie
ages 34 38 10 8 0
iq 100 70 80 140 NA
You can reference vectors and elements in dataframes in the same way as you do for matrices.
> mydataframe[3, 2] [1] 80 > mydataframe[, 1] [1] 34 38 10 8
0
4.2. MATRICES, ARRAYS, DATAFRAMES, LISTS
41
There exists also two other ways to reference columns. With the $ sign or with [[]] you can specify which variable you select.
> mydataframe$ages [1] 34 38 10 8
0
> mydataframe$iq [1] 100
70 80 140
NA
> mydataframe[[2]] [1] 100
70 80 140
NA
If you are not sure what the variable names of your dataframe are you can simply ask for them:
> names(mydataframe) [1] "ages" "iq"
4.2.4
Lists
The most flexible way to store data in R is a list. It does not put any restrictions on the length or type of data. The elements of lists can be vectors, matrices, dataframes and even lists. This kind of construction is sometimes useful if you want to collect all relevant information of one entity in one object. Look at the following example how you construct, how you reference elements of a list and how you can ask for the names of list elements.
> + > > > > > > + + >
simpsons <- c("Marge", "Homer", "Bart", "Lisa", "Maggie") ages <- c(34, 38, 10, 8, 0) sex <- c(2, 1, 1, 2, 2) address <- "743 Evergreen Terrace" pets <- c("Snowball II", "Santa s Little Helper") sisters <- c("Thelma", "Betty") mylist <- list(familynames = simpsons, demographics = data.frame(cbind(ages, sex)), location = address, pets = pets, relatives = sisters) mylist
42
CHAPTER 4. THE REPRESENTATION OF DATA IN R
$familynames [1] "Marge" "Homer" "Bart"
"Lisa"
"Maggie"
$demographics ages sex 1 34 2 2 38 1 3 10 1 4 8 2 5 0 2 $location [1] "743 Evergreen Terrace" $pets [1] "Snowball II"
"Santa s Little Helper"
$relatives [1] "Thelma" "Betty" > mylist$demographics 1 2 3 4 5
ages sex 34 2 38 1 10 1 8 2 0 2
> mylist[[4]] [1] "Snowball II"
"Santa s Little Helper"
> names(mylist) [1] "familynames" [4] "pets"
"demographics" "location" "relatives"
Maybe you think right now that lists are some kind of odd way to store information. But you will see later on that a lot of data in R is represented as a list and that it is very convenient to use that construction. For example, regression results are stored as a list with one element the actual function call, another containing the estimated coefficients, a third element consists of the fitted values, . . .
4.3. MISSING VALUES AND OTHER ODD NUMERICAL VALUES
4.3
43
Missing values and other odd numerical values
Missing values in R are labelled NA. It has already been briefly introduced in the vector iq when there was the value Not Available. The typical rule is that any operation involving NAs results in NA:
> mydataframe[5, 2] [1] NA > mydataframe[5, 2] + 4 [1] NA > mean(iq) [1] NA Many built-in functions give you, though, the possibility to operate on data even if they contain missing values. One example is the mean where you can pass the argument that na.rm=TRUE which stands for Not Available ReMove = TRUE. There is also a somewhat trickier way using the negation operator ! and the function is.na which is a function returning a logical value of TRUE or FALSE for each element.
> mean(iq, na.rm = TRUE) [1] 97.5 > mean(iq[!is.na(iq) == TRUE]) [1] 97.5 Besides missing values indicated as NA, R knows also three other “odd numerical values”: NaN, Inf and -Inf. NaN means Not a Number. You get this results when you take, for example, the square root of a negative number or by dividing
> 0/0 [1] NaN > sqrt(-1)
44
CHAPTER 4. THE REPRESENTATION OF DATA IN R
[1] NaN As you may guess from its abbreviation, Inf stands for Infinity. As there is no actual limit to numbers, you may wonder when you actually reach infinity on a finite machine like a computer. You can ask R actually for infinity by using the command .Machine which returns various values. What we are interested in are
> .Machine$double.xmax [1] 1.80e+308 > .Machine$double.xmin [1] 2.23e-308 If you cross this upper and lower limit, you reach infinity using R.4
> .Machine$double.xmax * 1.00000000000001 [1] Inf
4.4
Requesting Information About Data Structures
With the following functions you can retrieve information about R data objects which can be very useful. Besides the already names (which is actually an extraction function as you see below) introduced, you can use the following commands:
> dimnames(mydataframe) 4
Just in case you wonder why the value is a power of 308: it follows the IEEEStandard 754 (1985) for binary floating point operations. This standard is (should be) supported by all C-Compilers which is the language the core part of R was written. If you want to compile R yourself at home and you know that your CCompiler supports higher values than the one specified by IEEE, you simpliy change the constant DBL MAX in file (I forgot where it is). You can obtain more information about this interesting subject at http://grouper.ieee.org/groups/754/ or at http://www.validlab.com/goldberg/paper.ps An interesting reading material about those limits is also “The Ariane 5 explosion as seen by a software engineer” located at http://www.cas.mcmaster.ca/baber/TechnicalReports/Ariane5/Ariane5.htm.
4.5. EXERCISES [[1]] [1] "Marge"
45
"Homer" "Bart"
"Lisa"
"Maggie"
[[2]] [1] "ages" "iq" > attributes(mylist) $names [1] "familynames" [4] "pets"
"demographics" "location" "relatives"
> mode(iq) [1] "numeric" > mode(simpsons) [1] "character" > mode(mylist) [1] "list" > mode(mydataframe) [1] "list" > typeof(mylist) [1] "list" > typeof(ages) [1] "double" > typeof(simpsons) [1] "character" > length(simpsons) [1] 5 > length(mylist) [1] 5
4.5
Exercises
46
CHAPTER 4. THE REPRESENTATION OF DATA IN R
Chapter 5 Distributions in R 5.1
Introduction
R has many built-in distributions which are useful for many purposes. For example, if you want to simulate an experiment and you know require random numbers from a certain distribution. Or the test statistic t of a test yields a certain value and you would like to know how probable this value is if t is following a certain distribution under the null-hypothesis. For each distribution, R offers 4 specific functions: the density, the distribution function, the quantile function and the generation of random numbers. You call one of these functions for a certain distribution by giving the name.in.R for the distribution preceded by either one of the following letters d (for the density), p (for the distribution function), q (for the quantile function) or r (for the generation of random numbers). How these functions can be used will be shown by examples in the following sections based on the normal distribution.
5.2
(Cumulative) Distribution Function p
The cumulative distribution function (often abbreviated as cdf ) of a random variate X, denoted by FX (x), is defined by (Casella and Berger, 1990): FX (x) = PX (X ≤ x),
for all x
(5.1)
FX (x) denotes, thus, the probability for a specified distribution that any a value is realized which is smaller or equal to “your” X. In R, you get this probability
47
48
CHAPTER 5. DISTRIBUTIONS IN R
by preceding the name of your function with a p. We demonstrate this using the standard normal distribution with mean µ = 0 and standard deviation σ = 1. If X = 0, pnorm(0) should return the value 0.5 as this distribution is symmetric around µ — and it actually does.
> pnorm(0) [1] 0.5 You probably remember from your statistics lessons that you always took plus/minus two times the standard deviation to construct a confidence interval on the 95% level for normally distributed data. Let’s check whether this results in the values 0.025 and 0.0975.
> pnorm(-2) [1] 0.0228 > pnorm(2) [1] 0.977 Well, not exactly. But it should be close enough. If you want to take “better values” use 1.96 as a factor in the future.
> options(digits = 6) > pnorm(-1.96) [1] 0.0249979 > pnorm(1.96) [1] 0.975002
5.3
Density Function d
For continuous distributions, the density function is defined as (Vogel, 1995): fX (x) =
dFX (x) , dx
for all x.
(5.2)
For discrete distributions, this function is often also called probability distribution function and is defined simply as (Casella and Berger, 1990): fX (x) = P (X = x),
for all x.
(5.3)
5.3. DENSITY FUNCTION D
49
fX (x) denotes, thus, the probability (the density) to obtain your realization X for a given and specified distribution. In the case for the result from tossing a (fair) dice, the result is easy: as every number is equally probably fX (x) = P (X = x) = 16 for all x. But how can a value from a continuous distribution such as:
> dnorm(0.275) [1] 0.384139 be interpreted? Plotting the density of the corresponding (standard) normal distribution with reference lines at x = 0.275 and fX (0.275) should facilitate the understanding of that concept.
50
CHAPTER 5. DISTRIBUTIONS IN R
Figure 5.1: Density of A Standard Normal Distribution
NULL NULL
0.5
NULL
x:
0.275
0.0
0.1
0.2
fX(x)
0.3
0.4
fX(0.275) 0.384138915305705
−4
−2
0 x
2
4
5.4. QUANTILE FUNCTION Q
5.4
51
Quantile Function q
The quantile function which is requested by preceding the distribution in R with q can be seen as the inverse of the cdf . For a given probability FX (x) (bounded, of course, between [0; 1]) what is the corresponding value of x? Please see the following examples:
> qnorm(0.5) [1] 0 > qnorm(0.975) [1] 1.96 > qnorm(0.999) [1] 3.09
5.5
Generation of Random Numbers r
To obtain random numbers from a specified distribution, you typically have to precede the name of the distribution in R with an r. To generate 10 random numbers from a standard normal distribution, you simply enter:
> rnorm(10) [1] 1.6937 1.0485 -0.9687 -0.0609 1.1327 -0.8732 [7] 1.5037 0.1354 0.1451 -1.1404
5.5.1
Digression
It is actually not required to use this idea, but I want to briefly show you how you can obtain random numbers without using the function rdistributionname. It has been shown (e.g. Dagpunar, 1988): if you want to generate a random variate X of a distribution with cdf FX (·) you simply need to use the inverse of the cdf X = FX−1 (R)
(5.4)
where R is a uniformly distributed random number between 0 and 1 (R ∼ U (0, 1)). We use the inverse of the cdf which is equivalent to the quantile function described before. Now let’s try it:
52
CHAPTER 5. DISTRIBUTIONS IN R
> qnorm(runif(1)) [1] -0.0533 Seems to work. Let’s make a check:
> mean(rnorm(10^4)) [1] 0.00963 > sd(rnorm(10^4)) [1] 1.01 > mean(qnorm(runif(10^4))) [1] -0.0029 > sd(qnorm(runif(10^4))) [1] 1
5.6
Overview of Functions
Table 5.1 gives an overview of the built-in distributions (in package stats) and their respective names in R. Please check yourself by using ?name.in.r (e.g. ?rbinom or ?dpois) for the required parameters of the specific distribution you are interested in.
5.7 5.7.1
The Gompertz Distribution Introduction
Unfortunately, the Gompertz distribution is not built-in. However, we will need it quite often in Survival analysis. Therefore, I have defined several useful functions for the Gompertz distribution. Please check Section 7.4 starting on page 68 to get to know how you can define functions yourself. The Gompertz distribution requires always two parameters: • the scale parameter α and • the shape parameter β.
5.7. THE GOMPERTZ DISTRIBUTION
Table 5.1: Overview of Built-In Distributions Distributions in R Real Name R Name† Real Name R Name† Beta beta Logistic logis Binomial binom Multinomial multinom ‡ Cauchy cauchy Negative Binomial nbinom χ2 (Chi-Square) chisq Normal norm Exponential exp Poisson pois F f Wilcoxon Signed Γ (Gamma) gamma Rank Statistic signrank Geometric geom Student’s t t Hypergeometric hyper Uniform unif Log-Normal lnorm Weibull weibull † The R name has to be preceded by r, d, p or q to obtain a random number from this distribution the density, the distribution function, or quantiles from the respective distribution. ‡ Another version exists in package MASS; use it via: library(MASS); rnegbin(...); qnegbin(...); ...
53
54
CHAPTER 5. DISTRIBUTIONS IN R
5.7.2
Hazard Function
The hazard h(x) function of the Gompertz distribution is defined as: h(x) = αeβ∗x
(5.5)
And in R:
> hgompertz <- function(x, alpha, beta) { + return(alpha * exp(beta * x)) + }
5.7.3
Survivor Function
The survivor function S(x) of the Gompertz distribution is defined as: S(x) = e β ( α
1−eβx )
(5.6)
And in R:
> Sgompertz <- function(x, alpha, beta) { + return(exp((alpha/beta) * (1 - exp(beta * + x)))) + }
5.7.4
Cumulative Hazard Function
The cumulative hazard H(x) is defined as: H(x) = − ln S(x)
(5.7)
And in R:
> Hgompertz <- function(x, alpha, beta) { + return(-log(Sgompertz(x, alpha, beta))) + }
5.7.5
Density Function
The density function of the Gompertz distribution is defined as: fX (x) = αeβx e β ( α
And in R:
1−eβx )
(5.8)
5.7. THE GOMPERTZ DISTRIBUTION
55
> dgompertz <- function(x, alpha, beta) { + return(alpha * exp(beta * x) * exp((alpha/beta) * + (1 - exp(beta * x)))) + }
5.7.6
Cumulative Distribution Function
The cumulative distribution function FX (x) of the Gompertz distribution is defined as: FX (x) = 1 − S(x) (5.9)
> pgompertz <- function(x, alpha, beta) { + return(1 - Sgompertz(x, alpha, beta)) + }
5.7.7
Quantile Function
To obtain the quantile function of the Gompertz distribution we simply have to inverse the cdf . µ
FX−1 (p)
=β
−1
µ
µ ¶¶
ln − ln (1 − p) +
α β
µ ¶¶
− ln
α β
(5.10)
> qgompertz <- function(p, alpha, beta) { + return(beta^(-1) * (log(-log(1 - p) + + (alpha/beta)) - log(alpha/beta))) + }
5.7.8
Generation of Random Numbers
To obtain random numbers from the Gompertz distribution, we simply take the quantile function FX−1 (x) and use random numbers from a uniform distribution (R ∼ U (0, 1)). µ
µ
FX−1 (R) = β −1 ln − ln (1 − R) +
µ ¶¶
α β
µ ¶¶
− ln
In R, you use:
> rgompertz <- function(n, alpha, beta) { + return(beta^(-1) * (log(-log(1 - runif(n)) + + (alpha/beta)) - log(alpha/beta))) + }
α β
(5.11)
56
CHAPTER 5. DISTRIBUTIONS IN R
Typical parameters for a human population whose mortality follows a Gompertz distribution are:
> alpha = 1e-04 > beta = 0.1 To test whether our formulas are working correctly, we generate 100,000 random numbers and let the resulting histogram be overlapped with the density (see Fig. 5.2 on page 57).
> mygompis <- rgompertz(10^5, alpha, beta) > hist(mygompis, breaks = 50, xlim = c(0, 100), + freq = FALSE, ylim = c(0, 0.05)) > lines(x = 0:100, y = dgompertz(0:100, alpha, + beta), lwd = 3, col = "red")
5.8
Generating Random Numbers from Given Data — sample
Given you have some data and you want to pick one or more of those data at random, you use the function sample. In its simple form, you have to feed this function only with two arguments. The data from which you want to sample x and how many times you want to “pick” from this urn (size).
> sample(x = c("blue", "red", "green"), size = 2) [1] "green" "blue" Automatically, R assigns equal probabilities to each element of x and does not replace an element which has been picked. You can change these settings using the additional arguments prob and replace.
> sample(x = c("blue", "red", "green"), size = 10, + replace = TRUE, prob = c(0.5, 0.3, 0.2)) [1] "red" [7] "red"
"green" "blue" "blue" "blue" "red" "red"
"blue" "blue"
The function table will be shown later on. It gives you simple frequency tables. Let’s check whether we approach the given probabilities when we increase the sample size.
5.8. GENERATING RANDOM NUMBERS FROM GIVEN DATA — SAMPLE57
Figure 5.2: Overlapping Gompertz Random Numbers with the Corresponding Density
0.03 0.02 0.01 0.00
Density
0.04
0.05
Histogram of mygompis
0
20
40
60
mygompis
80
100
58
CHAPTER 5. DISTRIBUTIONS IN R
> (table(sample(x = c("blue", "red", "green"), + size = 10, replace = TRUE, prob = c(0.5, + 0.3, 0.2))))/10 blue green 0.3 0.1
red 0.6
> (table(sample(x = c("blue", "red", "green"), + size = 100, replace = TRUE, prob = c(0.5, + 0.3, 0.2))))/100 blue green red 0.59 0.13 0.28 > (table(sample(x = c("blue", "red", "green"), + size = 1000, replace = TRUE, prob = c(0.5, + 0.3, 0.2))))/1000 blue green red 0.505 0.172 0.323 > (table(sample(x = c("blue", "red", "green"), + size = 10000, replace = TRUE, prob = c(0.5, + 0.3, 0.2))))/10000 blue green red 0.500 0.203 0.297
5.9
Further Reading
• Dagpunar (1988) • Casella and Berger (1990) • Abramowitz and Stegun (1972)
5.10
Exercises
Chapter 6 Data Input/Output 6.1 6.1.1
Reading Data Introduction
So far, we have used only data which we generated ourselves. One of the major tasks of any statistical package is, though, to be able to analyze given data. How to read in data will be explained in the following subsections.
6.1.2
Setting the Working Directory
Here I want to briefly repeat the function setwd which sets the current working directory to the path and directory you want to read data from.
> setwd("n:\\Survive\\Datasets\\") Please note that the argument given (n:\\Survive\\Datasets\\) requires to be quoted ("...") and the various directories need to be separated by double backslashs (\\) if you are using a Windows platform. On Unix/Linux platform, you simply use one ordinary slash (/). You can let R tell you what the current working directoy is by using the getwd() function. The content of the current working directory is returned by using the function dir(). Please see the following code-example:
> getwd() [1] "n:/Survive/Datasets" > dir()
59
60 [1] [2] [3] [4] [5]
CHAPTER 6. DATA INPUT/OUTPUT "asthm98.sav" "Italy.cohort1890.txt" "Rintro.tex" "Sweden.cohort1890.txt" "Switzerland.cohort1890.txt"
6.1.3
Reading Text Data
Any statistical package is able to read pure text data (ASCII-Format). This format represents thus a format which allows to exchange data regardless which software is used. In R this operation is performed using the function read.table(). It reads in the specified file and transforms it into a dataframe. Several arguments are often used for this function which are explained now (if you are missing some explanation, please use: ?read.table). The only required argument is any case is the name of the file. The argument header is a logical variable indicating whether variable names are included in the top of each column or not. The default setting is FALSE. Therefore it assumes no variable names if header is not set to TRUE. The argument sep specifies which delimiter has been used to separate the columns/variables. If you are using a *.csv-file¤you need to tell R sep=",", if it ¡ is a space you write sep=" " and in case of a £TAB ¢-separated file you need to Ä . With these arguments, you should be specify the delimiter in C style: sep="" able to read almost any text data. Sometimes, the arguments skip and nrows are useful to specify whether you want to skip several lines in the beginning and/or if you only want to read a certain number of lines (=rows). Let’s show this via an example:
> dir() [1] [2] [3] [4] [5]
"asthm98.sav" "Italy.cohort1890.txt" "Rintro.tex" "Sweden.cohort1890.txt" "Switzerland.cohort1890.txt"
> sweden1890 <- as.data.frame(read.table("Sweden.cohort1890.txt", + sep = " ", header = TRUE)) > names(sweden1890) [1] "ageatdeath" "no.females" "no.males" > sum(sweden1890$no.females) [1] 60786
"no.total"
6.2. DATA EDITOR
6.1.4
61
Reading Binary Data
R is also able to read binary data from other software packages. The package foreign can read data from the following programs: • Epi Info • Minitab • SAS XPORT • SPSS • Stata: My own experience has shown me that R (version 1.8.0) had problems reading new stata data. Thus it is recommended to save your data in Stata using the flag old. E.g. save filename, old replace • S3: The ”old” binary format used in S-Plus for “S3-Classes”. Modern versions of S-Plus use now the S4-classes. Please use the command data.dump(..., oldstyle=TRUE) to enable R to read S-Plus binary data sets. • Excel Worksheets (provided by the package gregmisc). As usual, let’s show how to do it via an example. The specified data-set contains all deaths due to asthma in the United States in the year 1998 by sex, age, and month of death. Please note that read.spss does not read the data in automatically as a dataframe. Thus, we used the as.data.frame-function to enforce this conversion.
> library(foreign) > asthma <- as.data.frame(read.spss("asthm98.sav")) > names(asthma) [1] "SEX" "DY" [5] "AGECLASS" "COUNT_1"
"DM"
"AGE"
For other please read the help file for¨the foreign ¥ ¨ data formats, ¥ ¥ package ¨ “foreign” (Click on Help , Html Help and then choose the package foreign . For the § ¦§ ¦ § ¦ Excel-Format: library(gregmisc), ?read.xls.
6.2
Data Editor
Fiddling around manually in your data — as you may know it from SPSS or Excel — is also possible in R via the Data Editor. You can access the data editor using either fix(dataobject) or edit(dataobject) or via data.entry(dataobject), e.g. data.entry(asthma) or fix(asthma) or edit(asthma).
62
6.3
CHAPTER 6. DATA INPUT/OUTPUT
Writing Data
Sometimes, you do not only want to read data, but sometimes you also want to write to disk. The standard method to write data is the write.table-function. Not only its name but also its usage is very close to the reading-command read.table. Given you want to write the dataframe mydataframe to disk including the “variable” names and you want to separate the columns using a comma, you have to do the following (please note how the col.names=TRUE in write.table corresponds to header=TRUE in read.table):
> mydataframe ages Marge 34 Homer 38 Bart 10 Lisa 8 Maggie 0
iq 100 70 80 140 NA
> write.table(x = mydataframe, file = "simpsonsdata.txt", + sep = ",", row.names = TRUE, col.names = TRUE) > mynewdataframe <- read.table("simpsonsdata.txt", + header = TRUE, sep = ",") > mynewdataframe Marge Homer Bart Lisa Maggie
6.4
ages 34 38 10 8 0
iq 100 70 80 140 NA
Further Reading
Besides the aforementioned help-files on ?read.table, ?read.spss, . . . , the included manual “R Data Import/Export”, written by the R development core team, provides the most¨comprehensive overview how ¥ to read and write data in R. You ¥¨ can access it via Help , Manuals (in PDF) . §
6.5
¦§
Exercises
¦
Chapter 7 Programming with R 7.1
Introduction
So far, you might have the impression that R is nothing but just another statistical package just like SPSS or STATA. R is, however, much more: it is a high-level programming language (with many pre-built possibilities to manipulate, analyse and display data) which is capable to do anything other programming languages such as C, Fortran, or Pascal can do. The trade-off for its high-level of abstraction is the relative complexity of the language which seems to be overwhelming for the novice user.1 The following few sections should introduce you to the main concepts of programming in R.
7.2
Grouping Expressions
If you have some experience in programming, you know that any language provides a special construct for grouping expressions. If you have never heard of that. Don’t worry, you will grasp its usefulness within the next 10 minutes. In R,2 you can use curly braces to group your expressions. These expressions can either be separated by a semicolon or by a new line.
{ expression1 ; expression2; ...; expressionn } { expression1 expression2 expressionn 1
Especially compared to C which is built upon 32 keywords (Kernighan and Ritchie, 2000). 2 Just like in C or in LATEX.
63
64
CHAPTER 7. PROGRAMMING WITH R
}
7.3
Flow-Control
7.3.1
Conditional Execution
Branching via if Conditional execution is typically performed in any programming language using the command/function if. This is also the case for R. The main syntax is:
if ( condition is TRUE ) do_this > simpsons [1] "Marge"
"Homer" "Bart"
"Lisa"
"Maggie"
> if (ages[2] == max(ages)) print("Homer is the oldest") [1] "Homer is the oldest" Using the curly braces which serves as grouping symbol, we can introduce more expressions.
> if (ages[2] == max(ages)) { + another.expression <- 3 + print("Homer is the oldest") + 34 * 435 + } [1] "Homer is the oldest" [1] 14790 As is usually any programming language, it is also allowed in R to tell the interpreter what happens if the condition is not TRUE but FALSE via the else statement.
> myrandom <- rnorm(1) > myrandom [1] 1.42
7.3. FLOW-CONTROL
65
> if (myrandom < 0) { + print("My Random Number is Smaller than 0.") + } else { + print("My Random Number is 0 or Greater.") + } [1] "My Random Number is 0 or Greater." When working with if constructions, one wants to combine sometimes several conditions. For this purpose R provides the operators &,&&,|, ||. &and &&are needed for AND constructions; |and ||are their OR complements. The main differences is that the singly operators apply element-by-element the conditions to vectors, the double operators only apply to vectors of length one. See the following example for clarification:
> myrandom <- rnorm(10) > myrandom [1] -0.847 -1.237 0.238 -0.521 -1.245 -1.241 -0.775 [8] 0.175 0.797 0.735 > (myrandom < 0) && (myrandom < -1) [1] FALSE > (myrandom < 0) & (myrandom < -1) [1] FALSE TRUE FALSE FALSE TRUE [9] FALSE FALSE
TRUE FALSE FALSE
If you are working with vectors — which is usually the case — the function ifelse comes in handy. The syntax of ifelse is: ifelse(condition,true,false).
> iq [1] 100
70 80 140
NA
> ifelse(iq < 100, "Stupid", "Clever") [1] "Clever" "Stupid" "Stupid" "Clever" NA
66
CHAPTER 7. PROGRAMMING WITH R
Branching via switch Less often used than if but present in many programming languages is the switch function. Meaningful applications are hard to consider outside of new, user-defined, functions. This branching-option is therefore introduced in Section 7.4 on page 68 where it is explained how to develop your own functions.
7.3.2
Repetitive Execution
Besides the branching command if, there are several commands avaialble in R for repetitive execution.
for-loops The most common command for repetitive execution is the for loop. The syntax is for (parameter in parameterspace) dosomething. The parameter parameter takes on the first value of parameterspace in the first loop and executes dosomething. In the second run the second value is taken from parameterspace and so on until the last value is taken from parameterspace and dosomething is executed the last time. R accepts numeric values as well as character vectors. See the following examples for clarification (they also show for the first time in this booklet the possibility of nesting these control-flow statements:
> for (i in simpsons) { + print(i) + } [1] [1] [1] [1] [1]
"Marge" "Homer" "Bart" "Lisa" "Maggie"
> for (i in 1:5) { + print(simpsons[i]) + } [1] [1] [1] [1] [1]
"Marge" "Homer" "Bart" "Lisa" "Maggie"
7.3. FLOW-CONTROL
67
> for (i in (1:(length(simpsons)))) { + if (i < 2) { + print("Who is clever and who is stupid among the Simpsons?") + } + if (!is.na(iq[i])) { + if (iq[i] >= 100) { + print(c("Clever: ", simpsons[i])) + } + else { + print(c("Stupid: ", simpsons[i])) + } + } + } [1] [1] [1] [1] [1]
"Who is clever and who is stupid among the Simpsons?" "Clever: " "Marge" "Stupid: " "Homer" "Stupid: " "Bart" "Clever: " "Lisa"
while-loops Less often used is the construction of a while-loop. The syntax is a bit simpler than in the case of the for-loop: while (parameter is true) dosomething as you can see in this code-piece:
> i <- 10 > while (i > 0) { + print(i * i) + i <- i - 1 + } [1] [1] [1] [1] [1] [1] [1] [1] [1] [1]
100 81 64 49 36 25 16 9 4 1
68
CHAPTER 7. PROGRAMMING WITH R
As you can see, the while-loop needs necessarily an increment/decrement in the dosomething part (the body). Otherwise the program would end up in an infinite loop. The while construction is often used in numerical optimization to repeat a step until a certain lower boundary of error tolerance has been reached.
repeat-loops The syntax for the third loop construction — the repeat-loop — is simply: repeat dosomething. In comparison to the while-loop, there is no condition which has to be fulfilled. Therefore one has to include a statement when R has to stop iterating. This stop signal is in R the command break. Another difference is that while first tests the condition which has to be true. On the contrary, repeat immediately starts with the execution of the dosomething-part. An example is shown here:
> i <- 5 > repeat { + print(simpsons[i]) + i = i - 1 + if (i < 1) { + break + } + } [1] [1] [1] [1] [1]
"Maggie" "Lisa" "Bart" "Homer" "Marge"
7.4
Writing Functions
One of the biggest advantages of R is its immense flexibility: if you are unhappy with a way something is implemented or if you think something is missing, you can simply define your own function. Many functions in R have also been programmed that way (e.g. the standard deviation sd or the median median). The syntax is:
nameoffunction <- function(argument1, argument2, ...) { function-body return(myresult) }
7.4. WRITING FUNCTIONS
69
You assign a new function to a name (object) just like you assign values to variables via the <- operator. You need to tell the new function what kind of arguments it requires (argument1, argument2, ... for the function-body).3 Inside the function-body your calculations are done. If you want to have a certain value returned, you need to give the return-statement. This command is optional. If you omit it, the last calculated value will be returned. See the following simple example which squares the entered data(-vector).
> mysquare <- function(x) { + myresult <- x * x + return(myresult) + } > mysquare(34) [1] 1156 > mysquare(1:10) [1]
1
4
9 16
25 36
49 64
81 100
You can also give some default settings for the arguments. If you want to re-write the mean function to automatically exclude missing values (NAs) then you do simply:
> mymean <- function(x, missings = TRUE) { + return(mean(x, na.rm = missings)) + } > iq [1] 100
70 80 140
NA
> mean(iq) [1] NA > mymean(iq) [1] 97.5 > mymean(iq, FALSE) 3
There is no need like in many programming languages to tell R of what data-type these arguments are.
70
CHAPTER 7. PROGRAMMING WITH R
[1] NA R is not restricted to return just one value. The following code shows an extended example for calculating descriptive statistics of a given data-set.
> + + + + + + + + + + + > >
mydesc <- function(x) { rsltmin <- min(x) rsltmax <- max(x) rsltmed <- median(x) rsltmean <- mean(x) rsltsd <- sd(x) rsltvar <- var(x) myresult <- list(min = rsltmin, max = rsltmax, median = rsltmed, mean = rsltmean, stdev = rsltsd, var = rsltvar) return(myresult) } exampledata <- rnorm(1e+05) mydesc(exampledata)
$min [1] -4.23 $max [1] 4.31 $median [1] 0.00419 $mean [1] -5.38e-05 $stdev [1] 1.00 $var [1] 1.01 > (mydesc(exampledata))[[6]] [1] 1.01
7.4. WRITING FUNCTIONS
71
> (mydesc(exampledata))$var [1] 1.01 R also allows you to define functions within other functions. Please see the folowing code for a hypothetical example.
> average <- function(x) { + sumofelements <- function(data) { + return(sum(data)) + } + numberofelements <- function(data) { + return(length(data)) + } + core.procedure <- function(mydata) { + the.result <- sumofelements(mydata)/numberofelements(mydata) + return(the.result) + } + my.final.result <- core.procedure(x) + return(my.final.result) + } > average(exampledata) [1] -5.38e-05 > mydata [1] 1 2 3 4 [18] 8 9 10
5 6
7 8
9 10
1 2
3 4
5 6
7
Please note that the original data-set mydata has not been changed — although we used an argument called mydata. The reason is that R can distinguish whether an object is just needed within a function or whether it is globally defined.4 As promised in the section on conditional execution, the switch statement makes mainly sense within a new, user-defined, function. Please look at the following example to see how R can switch between several branches. We use a similar way to construct a function which calculates various descriptive statistics as before. First, we give an example without the switch statement. Afterwards, we show that switch helps to write a code which is more elegant. 4
If you want to learn more about it: this features is called the scope of functions.
72 > + + + + + + + + + + + > + + + + >
CHAPTER 7. PROGRAMMING WITH R
descnoswitch <- function(x, type = "mean") { if (type == "mean") { myresult <- mean(x) } if (type == "median") { myresult <- median(x) } if (type == "var") { myresult <- var(x) } return(myresult) } descswitch <- function(x, type = "mean") { myresult <- switch(type, mean = mean(x), var = var(x), median = median(x)) return(myresult) } descswitch(mydata, "median")
[1] 5.5 > descnoswitch(mydata, "median") [1] 5.5
7.5
Problems with loops
Compared to compiled languages, R is slower to execute loop structures. It is therefore recommended to try to avoid loops as much as possible. Thanks to the vectorized design of R, this is easier (in most cases) than it may seem. It should be noted, however, that it is not always possible to vectorize a loop. Some procedures are intrinsically repetitive. These vectorizations are usually performed by using one of the following functions: apply, sapply, tapply, or lapply. We will only show how to apply the functions tapply and sapply Consider 5 populations which are combined into one list:
> > > > >
pop1 pop2 pop3 pop4 pop5
<<<<<-
rnorm(10^3, rnorm(10^3, rnorm(10^3, rnorm(10^3, rnorm(10^3,
mean mean mean mean mean
= = = = =
12, sd = 32) 34, sd = 12) 76, sd = 65) 32, sd = 2) 324, sd = 32)
7.5. PROBLEMS WITH LOOPS
73
> populations <- list(p1 = pop1, p2 = pop2, + p3 = pop3, p4 = pop4, p5 = pop5) Now we would like to calculate the mean and the standard deviation of each of these populations. One could use a for-loop like this:
> > > + + >
no.of.pops <- length(populations) mymeans1 <- as.list(numeric(no.of.pops)) for (i in 1:no.of.pops) { mymeans1[[i]] = mean(populations[[i]]) } mymeans1
[[1]] [1] 14.175 [[2]] [1] 34.123 [[3]] [1] 73.173 [[4]] [1] 31.99 [[5]] [1] 326.17 Easier, more elegant, and much faster for huge datasets is the function tapply (and so does sapply — just in a more user-friendly form) which applies a certain function to all elements of a list:
> mymeans2 <- lapply(populations, mean) > mymeans3 <- sapply(populations, mean) > mymeans2 $p1 [1] 14.175 $p2 [1] 34.123
74
CHAPTER 7. PROGRAMMING WITH R
$p3 [1] 73.173 $p4 [1] 31.99 $p5 [1] 326.17 > mymeans3 p1 p2 p3 14.175 34.123 73.173
p4 p5 31.990 326.174
While those two previous commands (lapply and sapply) work element-wise on list, apply — as ?apply tells you — returns a vector or array or list of values obtained by applying a function to margins of an array. What is meant by this becomes clearer by giving an example: Consider a matrix or a dataframe of values:
> mydataframe Marge Homer Bart Lisa Maggie
ages 34 38 10 8 0
iq 100 70 80 140 NA
Now you want to know the mean of the ages and of the iq - i.e. the means of each column. This is done simply by using apply which has the syntax apply(data, dimension, function):
> apply(mydataframe, 2, mean, na.rm = TRUE) ages iq 18.0 97.5 The value 2 has been chosen for the dimension as this indicates the columns. Remember: The first dimension in R are the rows, then the columns come second. Individual “Layers” for three-dimensional arrays are accessed via the third index when referenced. This is also reflected in apply if you, for example, want to calculate the median in each “layer” of a given data-array:
7.6. FURTHER READING > apply(myarray, 3, median) [1] 0.518 0.784 0.505 0.428
7.6
Further Reading
• Venables and Ripley (2000)
7.7
Exercises
75
76
CHAPTER 7. PROGRAMMING WITH R
Chapter 8 Graphing Data 8.1
Introduction
As pointed out in the introduction, R has graphical capabilities which are hard to match by any statistical package/language. The following few sections will introduce you to some of the most important graphing commands. For that purpose, we use the data-set which contains the ages at death of the Swedish birth cohort from 1890.
> sweden1890 <- as.data.frame(read.table("Sweden.cohort1890.txt", + sep = " ", header = TRUE))
8.2
Basic plotting function: plot
The most simple plotting command has already been introduction. For a pointplot, just write:
> plot(sweden1890$no.females) The following code and its representation in Figure 8.1 show you how can develop a plot piece by piece. For that purpose, the new function par is introduced which sets graphical parameters. In our case we inform R that we want to construct a multi-panel graph consisting of 2 rows and 2 columns. If you want to get more information, check the respective help pages like ?par, ?plot, . . .
> par(mfrow = c(2, 2)) > plot(sweden1890$no.females) > plot(x = sweden1890$ageatdeath, y = sweden1890$no.females,
77
78 + > + > + > + + + > + > +
CHAPTER 8. GRAPHING DATA
type = "l", col = "red") plot(x = sweden1890$ageatdeath, y = sweden1890$no.females, type = "l", col = "red", lty = 1, lwd = 2) lines(x = sweden1890$ageatdeath, y = sweden1890$no.males, col = "blue", lty = 2, lwd = 1) plot(x = sweden1890$ageatdeath, y = sweden1890$no.females, type = "l", col = "red", lty = 1, lwd = 2, xlim = c(0, 110), ylim = c(0, 8000), xlab = "Age at Death", ylab = "No. of Deaths", axes = FALSE) lines(x = sweden1890$ageatdeath, y = sweden1890$no.males, col = "blue", lty = 2, lwd = 1) axis(side = 1, at = seq(from = 0, to = 110, by = 10), labels = TRUE, tick = TRUE)
NULL > axis(side = 2, at = seq(from = 0, to = 8000, + by = 2000), labels = TRUE, tick = TRUE) NULL > legend(x = 30, y = 6000, legend = c("Women", + "Men"), col = c("red", "blue"), lty = c(1, + 2), lwd = c(2, 1))
8.2. BASIC PLOTTING FUNCTION: PLOT
79
Figure 8.1: Constructing a Plot Piece By Piece
NULL
0
20
40
60
80
2000 4000 6000 0
sweden1890$no.females
2000 4000 6000 0
sweden1890$no.females
NULL
100
0
40
60
80
100
6000 2000
Women Men
0
No. of Deaths
2000 4000 6000
sweden1890$ageatdeath
0
sweden1890$no.females
Index
20
0
20
40
60
80
100
sweden1890$ageatdeath
0
20
40
60
80 100
Age at Death
80
8.3
CHAPTER 8. GRAPHING DATA
Histograms — hist
Histograms are a useful tool to display univariate time-series. The easiest way to accomplish this is to use the hist-function. The following code and its result in Figure 8.2 on page 81 show its usage.
> > + > + > + > +
par(mfrow = c(1, 2)) swedish.women <- rep(sweden1890$ageatdeath, sweden1890$no.females) hist(swedish.women, breaks = 110, xlim = c(0, 110), ylim = c(0, 8000)) swedish.men <- rep(sweden1890$ageatdeath, sweden1890$no.males) hist(swedish.men, breaks = 110, xlim = c(0, 110), ylim = c(0, 8000))
8.3. HISTOGRAMS — HIST
81
Figure 8.2: How to plot a histogram
6000
8000
Histogram of swedish.men
0
2000
4000
Frequency
4000 2000 0
Frequency
6000
8000
Histogram of swedish.women
0 20
60
100
swedish.women
0 20
60
swedish.men
100
82
CHAPTER 8. GRAPHING DATA
You can also make histograms using the argument type="h" for the command plot. Try it for yourself!
8.4
Barplot - barplot
A useful tool for the visual display of data is the barplot. Look at Figure 8.3 to see a stacked barplot as you may know it from Excel. The plot has been produced using the following code:1
> > + > + > + + >
data(UKLungDeaths) uklung <- aggregate(ts.union(mdeaths, fdeaths), 1) barplot(t(uklung), names = 1974:1979, col = c("blue", "red")) legend(x = 3, y = 15000, legend = c("Women", "Men"), bg = "white", fill = c("red", "blue")) title(main = "Death from Lung Disease in the UK")
NULL
1
Example taken from: Venables and Ripley (1999), slightly modified.
8.4. BARPLOT - BARPLOT
83
Figure 8.3: Barplot for Women and Men Dying From Lung Diseases in the UK
NULL
15000
20000
25000
Death from Lung Disease in the UK
0
5000
10000
Women Men
1974
1975
1976
1977
1978
1979
84
CHAPTER 8. GRAPHING DATA
8.5
Making boxplots — boxplot
Univariate data can be plotted by using a histogram. It does not give, however, much information about the descriptive statistics of the underlying data. One way to obtain these data is to use the command summary.
> swedish.men <- rep(sweden1890$ageatdeath, + sweden1890$no.males) > summary(swedish.men) Min. 1st Qu. Median 0.0 16.0 65.0
Mean 3rd Qu. 50.8 78.0
Max. 109.0
Mean 3rd Qu. 55.6 82.0
Max. 107.0
> summary(swedish.women) Min. 1st Qu. Median 0.0 25.0 70.0
Using boxplots facilitates this approach considerably.
> > > + > +
par(mfrow = c(1, 2)) data(ChickWeight) boxplot(ChickWeight$weight, xlab = "Chicken Weights", ylab = "Weight in Grams") boxplot(list(Women = swedish.women, Men = swedish.men), ylab = "Age at Death")
Figure 8.4 on page 85 shows the standard boxplots how R plots them. The boxplot with weights of the chicken have been used as the Swedish Cohort Data have no outliers. The line in the middle of each box shows the median of the respective dataset. The box around this median is bounded by the 25% quantile at the bottom and by the 75% quantile at the top. The distance between the 25% quantile and the 75% quantile is called the interquartile-range (IQR). The end of the whiskers indicate the 75% quantile +1.5 × IQR and the 75% quantile −1.5 × IQR. If values fall outside the range of the whiskers each individual is plotted as a small circle as we can see in the left panel of Figure 8.4.
8.5. MAKING BOXPLOTS — BOXPLOT
85
80 60 20
40
Age at Death
250 200 150 100
0
50
Weight in Grams
300
100
350
Figure 8.4: Constructing a Boxplot for Weights of Chickens (left) and Ages at Death of the Swedish Birth Cohort from 1890 (right)
Women Chicken Weights
Men
86
CHAPTER 8. GRAPHING DATA
8.6
QQ-Plots qqplot / qqnorm
The last kind of plot which will be introduced here is the so-called QQ-Plot. It is a useful tool in detecting irregularities in your data. They are usually employed for two different situations: • you want to check whether two empirical distributions are identical. • you want to check if one empirical distribution follows a certain distribution (e.g. normal distribution) QQ-Plots sort the given data and produces a plot where the quantiles from one data-set are plotted either against the other data-set or against the “theoretical” distribution. This helps a great deal to detect irregularities in your data, especially if we include an extra line via the abline command with intercept 0 and slope 1. If one distribution was matched by the other distribution they would both cover the added line. Figure 8.5 will help to understand the usefulness of these plots.
> > > > + > + > >
par(mfrow = c(2, 1)) qqplot(swedish.women, swedish.men) abline(0, 1, col = "red", lwd = 3) womnew <- quantile(swedish.women, probs = seq(from = 0, to = 1, length = 500)) mennew = quantile(swedish.men, probs = seq(from = 0, to = 1, length = 500)) plot(womnew, mennew) abline(0, 1, col = "red", lwd = 3)
As you can see from the second part of the code, it is not so difficult to construct such plots also by yourself — simply using some data transformations and then the basic plot-command.
8.7
Further Plotting Commands
The following code-pieces and graphs may repeat several things you already know now. The best thing is to look at the graphs and if you want to find out how you can construct such a plot, just check the corresponding code. Besides the already known par command which sets the graphic parameters in general you can also use the not so often used screen function to partition your screen.
8.7. FURTHER PLOTTING COMMANDS
87
80 40 0
swedish.men
Figure 8.5: An Example of a QQ-Plot
0
20
40
60
80
100
80
100
80 40 0
mennew
swedish.women
0
20
40
60 womnew
> split.screen(c(1, 2)) [1] 1 2 > split.screen(c(2, 1), screen = 2) [1] 3 4 > screen(1) > plot(x = sweden1890$ageatdeath, y = sweden1890$no.females, + type = "h", xlim = c(0, 110), ylim = c(0, + 6000)) > text(x = 20, y = 5000, labels = "a)")
88
CHAPTER 8. GRAPHING DATA
NULL > screen(3) > barplot(sweden1890$no.females, names.arg = sweden1890$ageatdeath, + xlim = c(0, 110), ylim = c(0, 6000)) > title(main = "Women") NULL > text(x = 20, y = 5000, labels = "b)") NULL > screen(4) > barplot(sweden1890$no.males, names.arg = sweden1890$ageatdeath, + xlim = c(0, 110), ylim = c(0, 6000)) > title(main = "Men") NULL > text(x = 20, y = 5000, labels = "c)") NULL > close.screen(all = TRUE)
8.7. FURTHER PLOTTING COMMANDS
89
Figure 8.6: Displaying Data in Histogram-like Plot
[1] 1 2 [1] 3 4 NULL NULL NULL NULL NULL
5000
b)
0 2000
4000
a)
3000
0 16 35 54 73 92
c)
0 2000
1000
5000
2000
Men
0
sweden1890$no.females
5000
6000
Women
0 20
60
100
sweden1890$ageatdeath
0 16 35 54 73 92
90
CHAPTER 8. GRAPHING DATA
While histograms might be appropriate to detect oddities like extreme outliers, they are less useful in other respects. The Figure 8.6 b) and c) may suggest that the distribution of age at death among Swedish women and men born in 1890 is very similar. Another plot-type, the so-called boxplot can usually shed further light on such an assessment.
> women <- rep(sweden1890$ageatdeath, sweden1890$no.females) > men <- rep(sweden1890$ageatdeath, sweden1890$no.males) > boxplot(list(Women = women, Men = men), range = 0)
0
20
40
60
80
100
Figure 8.7: Boxplots
Women
Men
8.7. FURTHER PLOTTING COMMANDS Figure 8.8 shows how you can relate histograms and boxplots
> split.screen(c(2, 1)) [1] 1 2 > split.screen(c(1, 2), screen = 1) [1] 3 4 > split.screen(c(1, 2), screen = 2) [1] 5 6 > screen(3) > plot(x = sweden1890$ageatdeath, y = sweden1890$no.females, + type = "h", xlim = c(0, 110), ylim = c(0, + 6000), xlab = "Age at Death", ylab = "No. of Deaths") > title("Women, Histogram") NULL > abline(v = (quantile(women))[[2]], col = "green") > text(x = 35, y = 2000, labels = "25% Quantile", + adj = 0, col = "green") NULL > > + > > + > > + + > > +
abline(v = (quantile(women))[[4]], col = "blue") text(x = 35, y = 5000, labels = "75% Quantile", adj = 0, col = "blue") abline(v = (quantile(women))[[3]], col = "red") text(x = 35, y = 3500, labels = "Median", adj = 0, col = "red") screen(5) plot(x = sweden1890$ageatdeath, y = sweden1890$no.males, type = "h", xlim = c(0, 110), ylim = c(0, 6000), xlab = "Age at Death", ylab = "No. of Deaths") abline(v = (quantile(men))[[2]], col = "green") text(x = 35, y = 2000, labels = "25% Quantile", adj = 0, col = "green")
NULL
91
92 > > + > > + >
CHAPTER 8. GRAPHING DATA
abline(v = (quantile(men))[[4]], col = "blue") text(x = 35, y = 5000, labels = "75% Quantile", adj = 0, col = "blue") abline(v = (quantile(men))[[3]], col = "red") text(x = 35, y = 3500, labels = "Median", adj = 0, col = "red") title("Men, Histogram")
NULL > > > + >
screen(4) women <- rep(sweden1890$ageatdeath, sweden1890$no.females) boxplot(women, range = 0, ylim = c(0, 110), ylab = "Age at Death") title("Women, Boxplot")
NULL > > > + >
screen(6) men <- rep(sweden1890$ageatdeath, sweden1890$no.males) boxplot(men, range = 0, ylim = c(0, 110), ylab = "Age at Death") title("Men, Boxplot")
NULL > > + > > > > > + > > > >
screen(1) plot(1:10, ylim = c(0, 110), xlab = "", ylab = "", axes = FALSE, type = "n") abline(h = (quantile(women))[[3]], col = "red") abline(h = (quantile(women))[[2]], col = "green") abline(h = (quantile(women))[[4]], col = "blue") screen(2) plot(1:10, ylim = c(0, 110), xlab = "", ylab = "", axes = FALSE, type = "n") abline(h = (quantile(men))[[3]], col = "red") abline(h = (quantile(men))[[2]], col = "green") abline(h = (quantile(men))[[4]], col = "blue") close.screen(all = TRUE)
This code looks fairly complicated. But if you have a closer look, you will see that the building blocks are mostly already known to you.
8.8. FURTHER READING
8.8
Further Reading
8.9
Exercises
93
94
CHAPTER 8. GRAPHING DATA Figure 8.8: Relating Histograms with Boxplots
[1] 1 2 [1] 3 4 [1] 5 6 NULL NULL NULL NULL NULL NULL
80 0
25% Quantile
40
Median
Age at Death
3000 6000
75% Quantile
Women, Boxplot
0
No. of Deaths
Women, Histogram
0 20
60
100
Age at Death
80 0
25% Quantile
40
Median
Men, Boxplot Age at Death
3000 6000
75% Quantile
0
No. of Deaths
Men, Histogram
0 20
60
Age at Death
100
Chapter 9 Simple Statistical Models 9.1
Introduction
This chapter is going to introduce the basics of how to use R to estimate statistical models. Our course focusses on Survival Analysis. Nevertheless, I want to introduce the basic concepts of regression modeling using simple linear models. For that purpose we use the data-set Davis in the package car which has information on weight and height of 200 broken down by sex, weight and height.1
> library(car) > data(Davis) > names(Davis) [1] "sex"
"weight" "height" "repwt" "repht"
> summary(Davis) sex F:112 M: 88
weight Min. : 39.0 1st Qu.: 55.0 Median : 63.0 Mean : 65.8 3rd Qu.: 74.0 Max. :166.0
height Min. : 57 1st Qu.:164 Median :170 Mean :170 3rd Qu.:177 Max. :197
repwt Min. : 41.0 1st Qu.: 55.0 Median : 63.0 Mean : 65.6 3rd Qu.: 73.5 Max. :124.0 NA s : 17.0
The variables repwt and repht on self-reported weight and self-reported height are not used here. 1
95
96
CHAPTER 9. SIMPLE STATISTICAL MODELS
repht Min. :148 1st Qu.:161 Median :168 Mean :168 3rd Qu.:175 Max. :200 NA s : 17 > Davis2 <- subset(Davis, Davis$height/Davis$weight > + 0.4) > Daviswomen <- subset(Davis2, Davis$sex == + "F") > Davismen <- subset(Davis2, Davis$sex == "M") In the line Davis2..., we only removed one case who had a height of 57 cm and a weight of 166kg — most probably an error during data entry.
9.2
Plotting the Data
We want to analyze whether there is a linear relationship which is able to predict the weight from his/her height. Despite the fact that we are planning to do some regression analysis, it is always recommended to plot your data. We use different colors for women and men as it is possible that the relationship is different for either sex.
> plot(x = Davismen$height, y = Davismen$weight, + col = "blue", xlim = c(150, 200), ylim = c(40, + 120), xlab = "Height", ylab = "Weight") > points(x = Daviswomen$height, y = Daviswomen$weight, + col = "red") > legend(x = 150, y = 120, legend = c("Women", + "Men"), pch = 1, col = c("red", "blue")) The point-plot in Figure 9.1 on page 97 suggests two things: (1) Overall, the linear assumptions seems to be okay. (2) This linear relationship is probably the same for women and for men.
9.2. PLOTTING THE DATA
97
120
Figure 9.1: Is there a linear relationship between height and weight? - A first glance at the data
80 60 40
Weight
100
Women Men
150
160
170
180 Height
190
200
98
9.3
CHAPTER 9. SIMPLE STATISTICAL MODELS
Estimating the Linear Model and Accessing the Components of the Result
R allows a relatively intuitive approach to write down to the formula of the equation to be estimated. In our case, we would like to predict the weight of person by its height. We express this in R as (lm stands for linear model):
> mylinmod <- lm(formula = Davis2$weight ~ Davis2$height, + data = Davis2) The results from the estimation can be accessed via the well-known function names. Not surprisingly, summary gives a summary of the most important values and coef allows to extract the regression coefficients.
> names(mylinmod) [1] [4] [7] [10]
"coefficients" "residuals" "rank" "fitted.values" "qr" "df.residual" "call" "terms"
"effects" "assign" "xlevels" "model"
> summary(mylinmod) Call: lm(formula = Davis2$weight ~ Davis2$height, data = Davis2) Residuals: Min 1Q Median -19.650 -5.419 -0.576
3Q Max 4.857 42.887
Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) -130.7470 11.5627 -11.3 <2e-16 Davis2$height 1.1492 0.0677 17.0 <2e-16 (Intercept) *** Davis2$height *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 8.52 on 197 degrees of freedom Multiple R-Squared: 0.594, Adjusted R-squared: 0.592 F-statistic: 288 on 1 and 197 DF, p-value: <2e-16
9.4. REGRESSION DIAGNOSTICS
99
> coef(mylinmod) (Intercept) Davis2$height -130.75 1.15 This informs us that adj. r2 = 0.592 and on average weight increases by 1.14kg for each gained centimeter in height. The height covariates is highly significant < 2 × 10− 16.
9.4
Regression Diagnostics
Now, let’s run several (graphical) diagnostics to check whether our model meets also its requirements. We do this by a 2 × 2-panel as shown in Figure 9.2 on 101: The upper left graph has been produced by the following code:
> > + + > + > + >
par(mfrow = c(2, 2)) plot(x = Davismen$height, y = Davismen$weight, col = "blue", xlim = c(150, 200), ylim = c(40, 120), xlab = "Height", ylab = "Weight") points(x = Daviswomen$height, y = Daviswomen$weight, col = "red") legend(x = 150, y = 120, legend = c("Women", "Men"), pch = 1, col = c("red", "blue")) abline(mylinmod)
The function mylinmod knows which components have to be extracted from the fitted model to plot the fitted values correctly (coef(mylinmod)[[1]] and coef(mylinmod)[[1]]). Now we add also fitting plots for women and men to check whether we would have obtained more or less the same parameters if we had estimated two separate models.
> + > + > >
mylinmod.women <- lm(formula = Daviswomen$weight ~ Daviswomen$height) mylinmod.men <- lm(formula = Davismen$weight ~ Davismen$height) abline(mylinmod.women, col = "red") abline(mylinmod.men, col = "blue")
Our plot suggest that it was okay to estimate the model for women and men at the same time. The plot in the upper right corner checks whether the mean of the residuals is zero as required by the linear model.
100
CHAPTER 9. SIMPLE STATISTICAL MODELS
> plot(fitted(mylinmod), resid(mylinmod)) > lines(lowess(fitted(mylinmod), resid(mylinmod)), + col = "red", lwd = 3) > abline(h = 0, col = "blue", lwd = 3) For that purpose, we plot the fitted values by the obtained residuals. We impose a scatterplot-smoother via the lowess-function and compare this plot with a reference line drawn at zero. Despite small deviations, we can state that this plot does not suggest that our linear model is not appropriate. The remaining two panels checks graphically whether our assumption that the data are drawn from a normal distribution is met. The lower panel on the left displays a histogram of the residuals.
> hist(resid(mylinmod), xlim = c(-40, 40), breaks = 25) Also this diagnostic gives a graphic result which resembles a normal distribution relatively well. The last plot which we show to check the fit of the model is the qqplot. Remember that you plot quantiles from your (empirical) data against theoretical quantiles.
> qqnorm(resid(mylinmod)) > qqline(resid(mylinmod)) First, we plot the qqplot of the residuals of our models against the theoretical quantiles of the normal distribution. To facilitate the checking, we add a qqline. Looking at this plot in the lower right corner, we can state that the residuals follow a normal distribution remarkably well and, thus, our assumption in the linear model have not been violated.
9.5
Technical Digression: using matrix language to estimate the coefficients
This is just a short digression which should show how you can estimate a linear model “yourself” using the built-in matrix language. Assume a linear model which can be denoted as y = Xβ + ²; In our case, the design matrix X consists only of one vector, the height of a person. The response vector y is the weight of the person. As it is well-known from many textbooks, you can solve for β using the following equation β = (X0 X)−1 X0 y.
9.5. TECHNICAL DIGRESSION: USING MATRIX LANGUAGE TO ESTIMATE THE COEFFICIEN
40
20 −20
0
resid(mylinmod)
80
100
Women Men
60
Weight
40
Figure 9.2: Regression Diagnostics
40
50
60
70
80
90
Histogram of resid(mylinmod)
Normal Q−Q Plot
20 −20
0
10 15 20 5
40
fitted(mylinmod)
Sample Quantiles
Height
0
Frequency
150 160 170 180 190 200
−40
−20
0
20
resid(mylinmod)
40
−3
−2
−1
0
1
2
Theoretical Quantiles
3
102
CHAPTER 9. SIMPLE STATISTICAL MODELS
How can we do this using the matrix language in R? It is relatively simple as you can see here (we only need to add another column for the design matrix consisting only of ones since the lm estimated also a model with an intercept).
> > > + > >
options(digits = 12) x = Davis2$height xmatrix <- as.matrix(cbind(rep(1, length(x)), x)) y = Davis2$weight coef(lm(y ~ x))
(Intercept) -130.74698436261
x 1.14922231385
> coefficients(lsfit(x = x, y = y)) Intercept -130.74698436261
X 1.14922231385
> solve(t(xmatrix) %*% xmatrix) %*% t(xmatrix) %*% + y [,1] -130.74698436262 x 1.14922231385
9.6
Further Reading
9.7
Exercises
Chapter 10 More Useful Functions Crosstabulations using table
10.1
If you would like to do cross-tabulations like crosstabs in SPSS or PROC MEANS ... in SAS, you simply use the command table in R. A relatively stupid but illuminating example would be a table of mydataframe.
> mydataframe Marge Homer Bart Lisa Maggie
ages 34 38 10 8 0
iq 100 70 80 140 NA
> table(mydataframe) iq ages 70 0 0 8 0 10 0 34 0 38 1
80 0 0 1 0 0
100 0 0 0 1 0
140 0 1 0 0 0
If you want to have the percentages of those entries, simply do:
> table(mydataframe)/(sum(table(mydataframe)))
103
104
CHAPTER 10. MORE USEFUL FUNCTIONS
iq ages 70 0 0.00 8 0.00 10 0.00 34 0.00 38 0.25
80 0.00 0.00 0.25 0.00 0.00
100 0.00 0.00 0.00 0.25 0.00
140 0.00 0.25 0.00 0.00 0.00
This command is also relatively useful to convert raw data into aggregated form. You remember how we expanded the data-set of Swedish women into individuals? We did it like this:
> swedish.women <- rep(sweden1890$ageatdeath, + sweden1890$no.females) To reverse this process we do:
> new.swedish.women <- table(swedish.women)
10.2
How many different values exist? unique
—
If you want to know which kind of values a certain variable takes on, you can use the function unique which return the unique values in the argument. Duplicates are automatically removed.
> unique(swedish.women) [1] [13] [25] [37] [49] [61] [73] [85] [97]
0 12 24 36 48 60 72 84 96
1 13 25 37 49 61 73 85 97
2 14 26 38 50 62 74 86 98
3 4 15 16 27 28 39 40 51 52 63 64 75 76 87 88 99 100
5 6 17 18 29 30 41 42 53 54 65 66 77 78 89 90 101 102
7 8 19 20 31 32 43 44 55 56 67 68 79 80 91 92 103 104
9 10 11 21 22 23 33 34 35 45 46 47 57 58 59 69 70 71 81 82 83 93 94 95 105 106 107
This is much faster than the equivalent:
> as.numeric(as.character(levels(as.factor(swedish.women))))
10.3. SPLITTING DATA — SPLIT AND CUT [1] [13] [25] [37] [49] [61] [73] [85] [97]
10.3
0 12 24 36 48 60 72 84 96
1 13 25 37 49 61 73 85 97
2 14 26 38 50 62 74 86 98
3 4 15 16 27 28 39 40 51 52 63 64 75 76 87 88 99 100
5 6 17 18 29 30 41 42 53 54 65 66 77 78 89 90 101 102
7 8 19 20 31 32 43 44 55 56 67 68 79 80 91 92 103 104
105
9 10 11 21 22 23 33 34 35 45 46 47 57 58 59 69 70 71 81 82 83 93 94 95 105 106 107
Splitting Data — split and cut
Sometimes you want to split your data into distinct groups, for example women and men. If you would like to obtain the weight of the Davis data-set separate for women and men, you simply do:
> sex.separated.weight <- (split(Davis$weight, + Davis$sex)) A similar function is cut(x, breaks). It assigns each value in x a factor-level specified by breaks. To put them into bins, you have to give the table-command. Let’s show this by putting 105 uniformly distributed random numbers into 10 bins of the same size. More or less, we should obtain the same number of elements in each bin.
> bindata <- runif(10^5) > bins1 <- table(cut(x = bindata, breaks = seq(from = 0, + to = 1, length = 11))) > bins1 (0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] 10033 9962 10058 10051 10047 (0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1] 10027 9949 10061 9904 9908 For exactly this operation, it is recommended to use the hist-function instead, since it is less memory hungry (as written in the help page of ?cut). As you can see, the result is exactly the same.
> bins2 <- (hist(x = bindata, breaks = seq(from = 0, + to = 1, length = 11), plot = FALSE))$counts > bins2
106
CHAPTER 10. MORE USEFUL FUNCTIONS
[1] 10033 9962 10058 10051 10047 10027 [9] 9904 9908
9949 10061
Sorting Data — sort and order
10.4
Sorting is somehow a bit tricky in R. So I want to show how you can sort data separately for single vectors and for dataframes.
10.4.1
Sorting by One Variable
Sorting Vectors If you want to sort a vector, you simply have to give the command sort. The default setting is ascending. For descending sorting, you have to give the argument decreasing=TRUE.
> ages [1] 34 38 10 8
0
> sort(ages) [1] 0
8 10 34 38
> sort(ages, decreasing = TRUE) [1] 38 34 10 8
0
Sorting Dataframes Sorting dataframes according to one variable is still relatively easy.
> mydataframe[order(mydataframe$ages), ] Maggie Lisa Bart Marge Homer
ages 0 8 10 34 38
iq NA 140 80 100 70
10.4. SORTING DATA — SORT AND ORDER
10.4.2
107
Sorting by More Than One Variable
Sometimes it is possible to calculate an index variable made up of several other variables. Then it is still not too difficult. See the following examples.
> > > > > + >
years <- sample(1980:1989, size = 10, replace = FALSE) months <- sample(1:12, size = 10, replace = TRUE) gdp <- sample(1:1000, size = 10) mysortindex <- years + ((months - 1)/12) gdpframe <- as.data.frame(cbind(years, months, gdp, mysortindex)) gdpframe
1 2 3 4 5 6 7 8 9 10
years months gdp mysortindex 1982 12 697 1982.91666667 1980 5 498 1980.33333333 1987 4 327 1987.25000000 1983 12 802 1983.91666667 1984 12 781 1984.91666667 1985 8 896 1985.58333333 1989 4 451 1989.25000000 1986 10 719 1986.75000000 1988 9 576 1988.66666667 1981 10 151 1981.75000000
> gdpframe[order(gdpframe$mysortindex), ] 2 10 1 4 5 6 8 3 9 7
years months gdp mysortindex 1980 5 498 1980.33333333 1981 10 151 1981.75000000 1982 12 697 1982.91666667 1983 12 802 1983.91666667 1984 12 781 1984.91666667 1985 8 896 1985.58333333 1986 10 719 1986.75000000 1987 4 327 1987.25000000 1988 9 576 1988.66666667 1989 4 451 1989.25000000
If you have more than one variable after which the data (vector or dataframe) should be ordered, you have to proceed as follows:
108 > + > + > > + >
CHAPTER 10. MORE USEFUL FUNCTIONS
newweight <- sample(x = c(80, 100, 120), size = 10, replace = TRUE) newheight <- sample(x = 150:200, size = 10, replace = TRUE) newsex <- sample(x = c(1, 2), size = 10, replace = TRUE) newdataframe <- as.data.frame(cbind(newheight, newweight, newsex)) newdataframe
1 2 3 4 5 6 7 8 9 10
newheight newweight newsex 154 80 2 198 100 2 151 120 2 154 120 2 197 120 2 188 120 1 191 120 1 199 120 1 154 100 1 174 120 2
> newdataframe[order(newdataframe$newsex, newdataframe$newweight), + ] 9 6 7 8 1 2 3 4 5 10
newheight newweight newsex 154 100 1 188 120 1 191 120 1 199 120 1 154 80 2 198 100 2 151 120 2 154 120 2 197 120 2 174 120 2
Chapter 11 Further Reading: 11.1
Background on R
• Ihaka and Gentleman (1996) • R Development Core Team (2003)
11.2
Introductions
• Venables and Ripley (1999) • Dalgaard (2002) • Krause and Olson (2000)
11.3
Graphics
11.4
Programming in R/S
11.5
...
109
110
CHAPTER 11. FURTHER READING:
Part II Using R for Survival Analysis
111
Bibliography Abramowitz, M. and I. Stegun (1972). Handbook of Mathematical Functions (10th printing ed.). Applied Mathematics Series - 55. Washington , D.C.: National Bureau of Standards. Casella, G. and R. L. Berger (1990). Statistical Inference. Belmont, CA: Duxbury Press. Dagpunar, J. (1988). Principles of Random Variate Generation. Oxford, UK: Clarendon Press. Dalgaard, P. (2002). Introductory Statistics with R. Statistics and Computing. New York, N.Y.: Springer. Ihaka, R. and R. Gentleman (1996). R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics 5 (3), 299–314. Kernighan, B. W. and D. M. Ritchie (2000). The C Programming Language (Second Edition ed.). Prentice Hall Software Series. Englewood Cliffs, NJ: Prentice Hall. Krause, A. and M. Olson (2000). The Basics of S and S-PLUS. Statistics and Computing. New York, N.Y.: Springer. R Development Core Team (2003). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-00-3. Venables, W. and B. Ripley (1999). Modern Applied Statistics with S-PLUS (3rd ed.). New York, NY: Springer. Venables, W. and B. Ripley (2000). S Programming. Statistics and Computing. New York, NY: Springer.
113
114
BIBLIOGRAPHY
Vogel, F. (1995). Beschreibende und schließ ende Statistik. Formeln, Definitionen, Erl¨ auterungen, Stichw¨ orter und Tabellen (8 ed.). M¨ unchen, D: Oldenbourg.