1
Data and R code ode
This handout shows and discusses several pieces of R code. The script R_intro.R includes all the code shown in these pages. You can access and run the code by opening R_intro.R in your R GUI (e.g. RStudio). This code uses real-world data collected in 2012 with a personal network survey among 107 Sri Lankan immigrants in Milan, Italy. Out of the 107 respondents, 102 reported their personal network. The data files include include ego-level ego-level data (sex, age, educational educational level, etc. for each Sri Lankan Lankan respondent), respondent), alter attributes attributes (alter’s nationality, country of residence, emotional closeness to ego etc.), and information on alter-alter ties in the form of adjacency matrices. Each personal network has a fixed size of 45 alters. The relevant data files are all located in the Data/ folder that you downloaded. The data files used in this document (and in R_intro.R) are the following: • • • •
ego_data.csv: This is a single csv file with ego attributes for all the egos. adj_28.csv: This is the adjacency matrix for ego ID 28’s personal network. elist_28.csv: The edge list for ego ID 28’s personal network. alter.data_28.csv : The alter attributes in ego ID 28’s personal network.
Information about data variables and categories is available in Data/Codebook.xlsx Data/Codebook.xlsx. NOTE: Before running the R code in R_intro.R, you should make sure that the Data/ folder with all the data files listed above is in your current working directory (use getwd() and setwd() to check/set your working directory — see examples below examples below). ). Also make sure to run the code in R_intro.R line by line : if you
skip one or more lines, the following lines may return errors.
2
Starting R
Upon opening R, you normally want to do two things: •
Check and/or set your current working directory . By default, R will look for files, and save new files, in this directory. setwd("/Users/John/Documents ts – To set your working directory to “/Users/John/Documents/Rworkshop”, just run setwd("/Users/John/Documen forward slashes slashes Note that that you you hav have to inpu inputt the the dire directo ctory ry path path with with forward – Windows Windows users: users: Note
(or double backwar backward d slashes), not with single single backslashes backslashes as in a typical Windows Windows path. I.e. setwd("C:/Users/John/Docum setwd("C:/Users/John/Documents/Rworksho ents/Rworkshop") p") or setwd("C:\\Users\\John\\D setwd("C:\\Users\\John\\Documents\\Rwo ocuments\\Rworkshop") rkshop") setwd("C:\Users\Mario\Documents\Rworksh ments\Rworkshop") op") won’t work. will work; setwd("C:\Users\Mario\Docu •
Load all the packages you need to use in the current session. There are two steps to using a package in R: install.packages("package_name") name") or the 1. Installing the package. You do this just once . Use install.packages("package_ appropriate menu in your R GUI (e.g. in RStudio: Tools > Install Packages in RStudio). Once you install a package, the package files are in your system R folder and R will be able to always find the package there. 2. Loading the package package in your current current session. session. Use library("package_name") library("package_name"). You do this in each R session in which you ned the package , i.e., every time you start R and you need the package. Use library(package_name) library(package_name) ( no quotation marks around the package name).
2.1 2.1 •
Cons Consol ole e vs scri script ptss When you open your R GUI, you typically see two separate windows: the script editor and the console. You can write R code in either of them. 2
3.2 •
• •
• • •
•
Vector ector and matrix matrix objects objects Vectors are the most basic objects you use in R. Vectors can be numeric (numerical data), logical
(TRUE/FALSE data), character (string data). The basic function to create a vector is c() ( concatenate ). Other useful functions functions to create vectors: vectors: rep() and seq(). Also keep in mind the : shortcut: c(1, c(1, 2, 3, 4) is the same as 1:4 . The length (number of elements) is a basic property of vectors ( length()). When we print() vectors, the numbers in square brackets indicate the positions of vector elements. To create a matrix: matrix(). Its main arguments are: the cell values (within c() ), number of rows (nrow) and number of columns ( ncol). Values are arranged in a nrow x ncol matrix by column . See ?matrix. When we print() matrices, the numbers in square brackets indicate the row and column numbers.
# Let Let s cre create ate a sim simple ple vector. vector. x <- c(1, 2, 3, 4)
# Dis Displa play y it. x
## [1] 1 2 3 4 # Shortc Shortcut ut for the sam same e thi thing. ng. y <- 1:4
y ## [1] 1 2 3 4 # Wh What at s th the e le leng ngth th of x? length(x)
## [1] 4 # The fun functi ction on rep rep() () rep replic licate ates s val values ues int into o a vec vector tor. . rep(1, times= 10 10) )
##
[1] 1 1 1 1 1 1 1 1 1 1
# (N (NOT OTE E th that at we di didn dn t ass assign ign the vec vector tor above, above, it was just pri printe nted d and lost). lost).
# Als Also o vec vector tors s the themse mselve lves s can be rep repeat eated. ed. x
## [1] 1 2 3 4 rep(x, times= 10 10) )
## [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 ## [36] 4 1 2 3 4
7
– [1 2 3 4] + [1 2] = [1+1 2+2 3+1 4+2] ( [1 2] is recycled once.) – [1 2 3 4] + [1 2 3] = [1+1 2+2 3+3 4+1] ([1 2 3] is recycled recycled one third third of a time: R will
warn that the length of longer vector is not a multiple of the length of shorter vector.)
4.2 • •
• • • • •
Compar Compariso isons ns and and logic logical al operat operation ionss Comparisons: >, < , <= , >= . Equal is == (NOT = ). Not equal equal is != . Note: Note: equal equal is ==, whereas = has a di ff erent erent meaning. = is used to assign function arguments (e.g. matrix(x, nrow = 3, ncol = 4)), or to assign objects ( x < - 2 is the same as x = 2). Comparisons result in logical vectors: TRUE/FALSE. Like arithmetic operations, comparisons are performed element-wise on vectors, and recycling applies. Logical operators: & for AND, | for OR. Negation (i.e. opposite) of a logical vector: !. Is value x in vector y ? x \%in \%in\% \% y.
# Just a few arit arithmeti hmetic c oper operatio ations ns betw between een vecto vectors rs to demon demonstra strate te elem elementent-wise wise # cal calcul culati ations ons and the rec recycl ycling ing rul rule. e. # [1 2 3 4] + [1 2 3 4] v1 <- 1:4 v2 <- 1:4 v1 + v2
## [1] 2 4 6 8 # [1 2 3 4] + 1 v1 <- 1:4 v2 <- 1 v1 + v2
## [1] 2 3 4 5 # [1 2 3 4] + [1 2] v1 <- 1:4 v2 <- 1:2 v1 + v2
## [1] 2 4 4 6 # [1 2 3 4] + [1 2 3] v1 <- 1:4 v2 <- 1:3 v1 + v2
## Warn Warnin ing g in v1 + v2: v2: long longer er obje object ct leng length th is not not a mult multip iple le of shor shorte ter r ## object object length length ## [1] 2 4 6 5
10
# What What kin kind d of var variab iables les are tho those? se? str(ego.df)
## ## ## ## ## ## ##
data.frame : $ ego ego_ID: int int $ se sex : in int $ ag age : in int $ arr_it arr_it: : int $ ed educ : in int $ incom income: e: int int
7
107 obs. of 6 variables: 28 29 32 33 35 35 36 39 40 45 46 .. ... 1 1 1 1 1 1 1 1 1 1 ... 61 38 38 43 43 30 30 25 25 63 63 29 29 56 56 52 52 35 35 .. ... 2008 2008 2000 2000 2002 2002 2010 2010 2009 2009 1990 1990 2007 2007 2008 1975 1975 2002 2002 ... 2 1 1 2 3 2 2 3 1 3 ... 350 350 900 800 800 200 1000 1000 1100 1100 0 950 950 1600 1600 1200 1200 ... ...
Ind Indexing •
•
• •
•
•
Indexing is crucial in R. Indexing means appending an index to an object to extract one (or more) of
its elements. Indexing is also called subscripting. Indexing can be used to extract (view, query) the element of an object, or to replace it (assign a diff erent erent value to that element). The basic notation for indexing in R is [ ]: x[i] gives you the i -th -th element of object x . Numeric indexing uses integers in square brackets [ ]: e.g. x[3]. Note that you can use negative integers to index (select) everything but that element: e.g. x[-3], x[c(-2,-4)]. Logical indexing uses logical vectors in square brackets [ ]. It’s used to index objects according to a condition, e.g. to index all values in x that are greater than 3 (see example code below). Name indexing uses element names. Elements in a vector, and rows or columns in a matrix can have names. – Names can be displayed and assigned using the names() function.
•
When indexing you must take into account the number of dimensions of an object. For example, vectors have 1 dimension, matrices have 2. Arrays can be defined with 3 dimensions or more (e.g. threeway tables). – Square brackets typically contain a slot for each dimension of the object, separated by a comma:
-th element of the one-dimensional object x ; x[i] indexes the i -th ∗ x[i,j] indexes the i,j -th -th element of the two-dimensional object x (e.g. x is a matrix, i refers to a row and j refers to a column); ∗ x[i,j,k] indexes the i,j,k -th -th element of the three-dimensional object x , etc. ∗
– Notice that a dimension’s slot may be empty: if x is a matrix, x[3,] will index the whole 3rd row of the matrix (i.e. [row 3, all columns ]). – If x has more than one dimension (e.g. it’s a matrix), then x[3] (no comma, just one slot) is still
valid, but it might give you unexpected results. •
•
Matric Matrices es have have special special functions functions that can be used used for indexing indexing,, e.g. e.g. diagonal(), upper.tri(), useful for manipulatin manipulatingg adjacency adjacency matrices. matrices. lower.tri(). These can b e useful Particular indexing rules may apply to particular classes of objects (e.g. see special indexing rules for lists and data frames in next section section). ).
# Nume Numeric ric inde indexing xing # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Vec Vector tor of 10 ele elemen ments. ts. x <- 31 31: :40 x
22
# # # # #
***** EXE ***** EXERCI RCISE SE 2: Using Usi ng nam name e ind indexi exing ng on mat matrix rix adj adj, , and the sum sum() () fun functi ction, on, cal calcul culate ate Mar Mario io s outdeg out degree ree and indegree indegree. . HIN HINT: T: In the sum() fun functi ction, on, you nee need d to set na.rm= na. rm=TRU TRUE. E. See ?su ?sum m for mor more e det detail ails. s. ***** *** **
# ***** ***** EXE EXERCI RCISE SE 3: # Us Use e is.n is.na( a() ) an and d lo logi gica cal l in inde dexi xing ng to re reco code de al all l NA va valu lues es in ad adj j to 99 99. . # *** ***** **
7.1 •
Indexi Indexing ng and and subset subsettin ting g data data frames frames List notations. Data frames are a special class of lists. Just like any list, data frames can be indexed
in the following 3 ways: df["variable.name"]. This returns 1. [ ] notation, e.g. df[3] or df["variable.name"] returns another another data frame that only rd includes the indexed element(s), e.g. only the 3 element. Note: data.frame class : the result is still a data frame. – This notation preserves the data.frame – This notation can be used to index multiple elements of a data frame into a new data frame, e.g. df[c(1,3,5)] or df[c("sex", "age")] 2. [[ ]] notation, notation, e.g. df[[3]] or df[["variable.name"]] This returns returns the specific specific elemen elementt df[["variable.name"]]. This (column) selected, not as a data frame but as a vector with its own type and class, e.g. the numeric vector within the 3rd element of df. Note two di ff erences erences from the [ ] notation: a data frame. – [[ ]] does not preserve the data.frame class. The result is not a – Consistently with this, [[ ]] can only be used to index a single element element (column) of the data frame, not multiple elements. 3. The $ nota notati tio on. If variable.name is the the name name of a spec specifi ificc varia ariabl blee (col (colum umn) n) in df, then df$variable.name ind index exes es that vari variab able le.. This This is the the same same as the the [[ ]] notation: df$variable.name is the same as df[["variable.name"]] df[["variable.name"]], and it’s also the same as df[[i]] (where i is the position of the variable called variable.name in in the data frame). •
Matrix notation. Data frames can also be indexed like a matrix, with the [ , ] notation:
df[2, , ], df[ df[ ,3]. – df[2,3], df[2 – df[,"age"], df[,c("sex", "age")], df[5,"age"] •
Keep in mind the di ff erence erence between the following:
Extracting a data frame’s frame’s variable variable (column) in itself, as a vector (numeric, (numeric, character, character, etc.) — – Extracting The single pepper packet by itself in the figure below (panel C). This is given by df[[i]], (with h the the df[["variable.name"]], df$variable.name , df[,i], df[,"variable.name"] (wit comma). Extracting another another data frame frame of just one variable variable (column) (column) – The single pepper packet packet within – Extracting the pepper shaker in the figure below (panel B). This is given by df[i], df["variable.name"] df["variable.name"]. •
You can use logical indexing and the [ , ] notation to extract all the rows (cases) of a data frame that meet a given condition (e.g. all women, all individuals older than 40 years old, all individuals with an income higher than 1000, etc.). This is called subsetting a data frame. – You can obtain the same results with the subset() function, which is sometimes more intuitive to
use.
27
# Loa Load d igraph igraph library(igraph) # Ma Manu nual ally ly ty type pe a si simp mple le ed edge ge li list st as a ma matr trix ix. . elist elist <- matrix(c("mark" "mark", ,"paul" "paul", , "mark", "mark","anna" "anna", , "theo", "theo","kelsie" "kelsie", , "mario", "mario" ,"anna" "anna", , "kelsie", "kelsie","mario" "mario", , "kelsie", "kelsie", "anna"), "anna" ), nrow= 6, byrow=TRUE byrow=TRUE) ) colnames(elist) (elist) <- c("FROM" "FROM", , "TO") "TO")
elist ## ## ## ## ## ## ##
[1,] [1,] [2,] [2,] [3,] [3,] [4,] [4,] [5,] [6,] [6,]
FROM "mar "mark" k" "mar "mark" k" "the "theo" o" "mario "mario" " "kelsie" "kelsie" "kelsi "kelsie" e"
TO "pau "paul" l" "ann "anna" a" "kel "kelsi sie" e" "ann "anna" a" "mario" "mario" "anna" "anna"
# Us Use e th that at ed edge ge li list st to cr crea eate te a ne netw twor ork k as an ig igra raph ph ob obje ject ct. . gr <- graph_from_edgelist (elist) # In igraph igraph, , net networ works ks are obj object ects s of cla class ss "ig "igrap raph". h". class(gr)
## [1] "igrap "igraph" h" # Sho Show w the graph. graph. ## Set seed to always always get the same vertex vertex layout. layout. set.seed(609 609) ) ## Plot Plot plot(gr)
paul theo mark kelsie
anna
mario
38
2827
2817 2844 2816 2818 2824
2836
2814 2834
828 2829
2840 2841 2811 2801
2832 2842 2831 2815 2838 2833 2813 2835 2823 28222843 28452839 2837 2826 2821 2825
2805 2804 2802 2807 2806 2803 2808 2810 2809 2812
2819
2820
2830
# Pri Print nt the graph. graph. gr
## ## ## ## ## ## ## ## ## ## ## ##
IGRA IGRAPH PH UNWUNW- 45 259 259 -+ attr: attr: name name (v/c), (v/c), weight weight (e/n) (e/n) + edges edges (verte (vertex x names) names): : [1] 2801--2802 2801--2802 2801--2803 2801--2803 2801--2804 2801--2804 [7] 2801--2808 2801--2808 2801--2809 2801--2809 2801--2810 2801--2810 [13] 2801--281 2801--2814 4 2801--281 2801--2815 5 2801--281 2801--2818 8 [19] 2801--282 2801--2827 7 2801--282 2801--2828 8 2801--282 2801--2829 9 [25] 2802--280 2802--2803 3 2802--280 2802--2804 4 2802--280 2802--2805 5 [31] 2802--280 2802--2809 9 2802--281 2802--2810 0 2802--281 2802--2811 1 [37] 2802--282 2802--2823 3 2802--283 2802--2831 1 2802--284 2802--2840 0 [43] 2803--280 2803--2806 6 2803--280 2803--2807 7 2803--280 2803--2808 8 + ... omitted omitted severa several l edges edges
2801--2805 2801--2805 2801--2811 2801--2811 2801--282 2801--2820 0 2801--283 2801--2831 1 2802--280 2802--2806 6 2802--281 2802--2812 2 2802--284 2802--2841 1 2803--280 2803--2809 9
2801--2806 2801--2806 2801--2812 2801--2812 2801--282 2801--2823 3 2801--284 2801--2840 0 2802--280 2802--2807 7 2802--281 2802--2813 3 2803--280 2803--2804 4 2803--281 2803--2810 0
2801--2807 2801--2807 2801--2813 2801--2813 2801--282 2801--2825 5 2801--284 2801--2841 1 2802--280 2802--2808 8 2802--281 2802--2815 5 2803--280 2803--2805 5 2803--281 2803--2811 1
# The graph is Und Undire irecte cted, d, Nam Named, ed, Weighted Weighted. . It has 45 ver vertic tices es and 259 edg edges. es. It # has a ver vertex tex att attrib ribute ute cal called led "na "name" me", , and an edge edge att attrib ribute ute cal called led "we "weigh ight". t". # Ge Get t th the e gr grap aph h fr from om an ex exte tern rnal al ed edge ge li list st, , pl plus us a da data ta se set t wi with th ve vert rtex ex # attr attribute ibutes. s. ## Read Read in the the edge edge list list. . This This is a pers person onal al netw networ ork k edge edge list list. . elist elist <- read.csv("./Data/elist_28.csv" "./Data/elist_28.csv") ) head(elist)
## ## ## ## ## ## ##
1 2 3 4 5 6
from 2801 2801 2801 2801 2801 2801
to weight 2802 1 2803 1 2804 1 2805 1 2806 1 2807 1
## Read Read in the vertex vertex attrib attribute ute data set. set. This This is an alter alter attrib attribute ute data set. set. vert.att vert.attr r <- read.csv ("./Data/Alter "./Data/Alter_attributes/a _attributes/alter.data_28. lter.data_28.csv" csv") ) head(vert.attr)
40
# Clear Clear the wor worksp kspace ace fro from m exi existi sting ng obj object ects. s. rm(list=ls()) # What What s the curr current ent work working ing dire director ctory? y? # getw getwd() d() # (De (Delet lete e the lea leadin ding g "#" to act actual ually ly che check ck you your r cur curren rent t dir direct ectory ory.) .)
# # # # # #
Change the working directory. Change directory. setwd("my_directory") setwd("my_dir ectory") (Delet (De lete e the leading leading "#" and typ type e in you your r act actual ual working working dir direct ectory ory instead instead of "my_di "my _direc rector tory": y": thi this s sho should uld be the dir direct ectory ory whe where re you sav saved ed the "Da "Data" ta" fol folder der you dow downlo nloade aded d for thi this s wor worksh kshop. op. E.g E.g. . setwd("/Users/John/Documen setwd("/Users /John/Documents/Rworkshop" ts/Rworkshop")). )).
# In a fo for r lo loop op, , th the e inde index x i is assig assigne ned d th the e 1s 1st, t, 2nd, 2nd, .. ..., ., nth va valu lue e in a # vec vector tor, , the then n aft after er eac each h ass assign ignmen ment t the loop cod code e is run run. . for for (i in 1:5) { print(i + 1) } ## ## ## ## ##
[1] [1] [1] [1] [1]
2 3 4 5 6
# If we wanted wanted to see what a sin single gle iterati iteration on doe does s bef before ore running running the who whole le # loo loop. p. i <- 1 print(i + 1) ## [1] 2
# No Note te th that at th the e in inde dex x can can ha have ve an any y na name me, , an and d th the e ve vect ctor or ca can n be any ki kind nd of # vect vector. or. for (lette (letter r in c("a" "a", , "b" "b", , "c" "c")) )) { print(letter) } ## [1] [1] "a" "a" ## [1] [1] "b" "b" ## [1] [1] "c" "c"
# Impo Importing rting alter attr attribut ibutes es # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Le Let t s us use e a fo for r lo loop op to im impo port rt al all l th the e al alte ter r at attr trib ibut ute e da data ta. .
# To cl clar arif ify y wh what at we ar are e go goin ing g to do do, , we fi firs rst t im impo port rt al alte ter r at attr trib ibut utes es fo for r ju just st # one ego, ego ID 28. alter.at alter.attr tr <- read.csv("./Data/Alte "./Data/Alter_attributes/ r_attributes/alter.data_28 alter.data_28.csv" .csv") )
3
3
Writi riting ng func functi tion onss in in R • •
•
•
•
•
•
One of the most powerful tools in R is the ability to write your own functions . A function is a piece of code that operates on one or multiple arguments (the input ), ), and returns an output (the function value in R terminology).Everything that happens in R is done by a function. Many R functions have default values for their arguments: arguments: if you don’t don’t specify the argument’ argument’ss value, the function function will use the default. Once you write a function and define its arguments, you can run that function on any argument values you want — provided that the function actually works on those argument values. For example, if a function takes an igraph object as an argument, you’ll be able to run that function on any network you like, provided that the network is an igraph object. If your network is a network object (created by a statnet function), the function will likely return an error. Functions, unctions, combined combined with loops or with other methods (more on this in the following following sections), sections), are the best way to run exactly the same code on many diff erent erent objects (e.g. many diff erent erent ego networks). networks). Functions are crucial crucial for code reproduci reproducibili bility ty in R. If you write functions, you won’t need to re-write (copy and paste) the same code over and over again — you just write it once in the function, then run the function any time and on any arguments you need. This yields clearer, shorter and more readable code. New functions are also commonly used to redefine existing functions by pre-setting the value of specific arguments. arguments. For example, if you want all your plots to have red as color, you can take R’s existing plotting function plot, and wrap it in a new function that always executes plot with the my.plot <- function( function(...) ...) plot(... plot(..., , argument col="red" . Your function would be something like my.plot col="red"). (Examples in the following sections.) Tips and tricks with functions: – stopifnot() is useful to check that function arguments are of the type that was intended by the
function author. It stops the function if a certain condition is not met by a function argument (e.g. argument is not an igraph object, if the function was written for igraph objects). – return() allows you to explicitly set the output that the function will return (clearer code). It is also used to stop function execution earlier earlier under certain conditions. conditions. Note: If you don’t use return(), the function value (output) is the last object that is printed at the end of the function code. – if is a flow control tool that is frequently used within functions: it specifies what the function should do if a certain condition is met at one point. – First First think particular, particular, then generalize. generalize. When you you want to write a function, function, it’s a good idea to first try the code on a “real”, specific existing object in your workspace. If the code does what you want on that object, you can then wrap it into a general function to be run on any similar object (see examples in the code below). •
What the following code does . – We write a function that performs specific calculations on the alter-attribute data frame of a given
ego. We then run that function on the alter-attribute data frames of di ff erent erent egos. # An Any y pi piec ece e of code code yo you u ca can n wr writ ite e an and d ru run n in R, yo you u ca can n al also so put in a fu func ncti tion on. . # Le Let t s wri write te a tri trivia vial l fun functi ction on tha that t tak takes es its argumen argument t and multipl multiplies ies it by 2.
times2 times2 <- function function(x) (x) { x*2 x*2 } # No Now w we ca can n ru run n th the e fu func ncti tion on on an any y ar argu gume ment nt. . times2(x=
3)
## [1] 6
10
# Ret Return urn the dat data a fra frame me return(freq)
} # No Now w we ca can n ru run n ex exac actl tly y th that at co code de on th the e al alte ter r at attr trib ibut ute e da data ta fr fram ame e of an any y eg ego. o. # Ego 60 table.nat(alter.attr.list[[ (alter.attr.list[["60" "60"]]) ]])
## Nati Nation onal alit ity y Freq Freq ## 1 Sri Lankan 33 ## 2 Italian 1 ## 3 Other 11 # Ego 85 table.nat(alter.attr.list[[ (alter.attr.list[["85" "85"]]) ]])
## Nati Nation onal alit ity y Freq Freq ## 1 Sri Lankan 44 ## 2 Italian 1 # Eg Ego o 10 102 2 table.nat(alter.attr.list[[ (alter.attr.list[["102" "102"]]) ]])
## Nati Nation onal alit ity y Freq Freq ## 1 Sri Lankan 30 ## 2 Italian 15 # # # # # # # # # #
***** EXERC ***** EXERCISE: ISE: Write Wri te a fun functi ction on tha that t tak takes es an adj adjace acency ncy mat matrix rix as arg argume ument. nt. The fun functi ction on return ret urns s the num number ber of "pr "proba obable ble" " edg edges es in the cor corres respon pondin ding g per person sonal al net networ work. k. Run Ru n th this is fu func ncti tion on on th the e ad adja jace cenc ncy y ma matr tric ices es of eg ego o ID 47 47, , 53 an and d 16 162. 2. HI HINT NT: : You Yo u sh shou ould ld fi firs rst t tr try y th the e co code de on on one e ad adja jace cenc ncy y ma matr trix ix fr from om th the e li list st adjace adj acency ncy.li .list. st. An edg edge e is "pr "proba obable ble" " if the cor corres respon pondin ding g cel cell l val value ue in the adjace adj acency ncy mat matrix rix is 2. The These se per person sonal al net networ work k are und undire irecte cted, d, so you sho should uld only onl y con consid sider er the upp upper er tri triang angle le of the mat matrix rix, , see ?up ?upper per.tr .tri. i. Rem Rememb ember er tha that t the ego IDs are in nam names( es(adj adjace acency ncy.li .list) st). . ***** *** **
4
The apply family of functions •
The apply family of functions is a collection of R functions designed to execute the same other function on multiple on multiple elements of a vector, matrix, or list .
•
•
This is the same idea as a for loop. loop. In fact, fact, the apply-like functions are the “R way” to loops . In R, apply functions are more e fficient and need shorter code (which means higher readability and reproducibility). So, although they do essentially the same thing as a for loop, in R you should always prefer apply-like functions over for loops whenever that’s possible. We will consider 3 of these functions: – lapply: Takes an object x (vector or list) and a function FUN as arguments. Applies FUN to every element of x. Return Returnss the results results in a new list ( lapply stands for list-apply ). ). FUN can be an already existing function, or a new function that you created. 14
# In fact, fact, whatev whatever er we ca can n do on an eleme element nt of a li list st, , we can pu put t in a fu func ncti tion on # and do it sim simult ultane aneous ously ly on all the elements elements of tha that t lis list t via lapply/s lapply/sapp apply. ly. # Pu Put t th the e co code de ab abov ove e in into to a fu func ncti tion on. . family.d family.deg eg <- function function(gr) (gr) { deg deg <- degree(gr, V(gr)[relation==1 (gr)[relation==1]) mean(deg) } ## Run the functi function on on all the 102 personal personal networks networks. . family.degrees family.degrees <- sapply(graph.list, family.deg) # This This gen genera erated ted a vec vector tor with ave averag rage e deg degree ree of clo close se fam family ily members members for all # ego egos. s. Vec Vector tor names names are ego IDs IDs. . head(family.degrees)
## 28 29 33 35 39 40 ## 14.80000 14.80000 14.80000 14.80000 19.33333 19.33333 24.25000 24.25000 14.50000 14.50000 21.00000 21.00000 # # # # # #
***** EXE ***** EXERCI RCISE SE 1: Write Wri te a fun functi ction on to cal calcul culate ate the max between betweennes ness s of an alt alter er in a per person sonal al networ net work. k. Use a gra graph ph in gra graph. ph.lis list t to try the fun functi ction. on. Then lap lapply ply the functi fun ction on to gra graph. ph.lis list. t. Fin Finall ally, y, sap sapply ply the fun functi ction on to gra graph. ph.lis list. t. Whi Which ch is better bet ter in thi this s cas case, e, lap lapply ply or sap sapply ply? ? HIN HINTS: TS: ?be ?betwe tweenn enness ess ***** *** **
# # # # # # #
***** EXE ***** EXERCI RCISE SE 2: Write Wri te a fun functi ction on tha that t ret return urns s the nat nation ionali ality ty of the alter with max maximu imum m betwee bet weenne nness ss in a per person sonal al net networ work. k. Use a gra graph ph in gra graph. ph.lis list t to try the functi fun ction. on. Sap Sapply ply the fun functi ction on to gra graph. ph.lis list. t. HIN HINT: T: If mul multip tiple le alt alters ers hav have e betwee bet weenne nness ss equ equal al to the max maximu imum m bet betwee weenne nness, ss, jus just t pic pick k the fir first st alt alter er using usi ng ind indexi exing, ng, i.e i.e. . [1] [1]. . ***** *** **
5
Adv Advanced anced tools tools for spli split-a t-appl pply-c y-com ombin bining ing •
In many diff erent erent scenarios we use the so-called so-called “Split-Apply-Combine” Split-apply-“Split-Apply-Combine” strategy. strategy. Split-apply
combining is what we do whenever we have a single file or object with all our data and: 1. We split the object into pieces according according to one or multiple multiple (combination (combinationss of) categorical categorical variables variables (split ). ). 2. We apply exactly exactly the same kind of calculation calculation on each piece, identically identically and independently independently (apply ). ). 3. We pick the results results and re-combi re-combine ne them together, together, e.g. into a new dataset ( combine ). ). •
The split-apply-com split-apply-combine bine strategy is essential essential in ego-network ego-network analysis. analysis. With ego-networks ego-networks,, we are
•
repeatedly (1) splitting the data into pieces, each piece typically corresponding to one ego; (2) performing identical and independent analyses on each piece (each ego-network); (3) recombining the results together, typically into a single dataset, to then associate them with other ego-level variables. for loops (see previous section previous section)) and apply-like functions (see previous section previous section)) are one way to perform split-apply-combining in R. In this section we’ll look at more advanced ways, which are often quicker
and more e fficient: – The aggregate function. – The plyr package. – The dplyr and purrr packages.
20
# Set "white" "white" as bac backgr kgroun ound d col color or for the plo plot. t. par(bg= bg="white" "white") ) # Plot Plot th the e gr grap aph h plot(gr)
} # Le Let t s pri print nt an example example of my.plot( my.plot() ) to the R GUI by plotti plotting ng the per person sonal al # ne netw twor ork k of ego ID 45 45. . ## Set seed seed for reprod reproduci ucibil bility ity (so we always always get the same same networ network k layout layout). ). set.seed(613 613) ) ## Plot Plot my.plot(graph.list[["45" (graph.list[["45"]]) ]])
# Le Let t s cre create ate a sub subfol folder der cal called led "Fi "Figur gures" es" in our wor workin king g dir direct ectory ory, , whe where re all # the figures figures cre create ated d in thi this s sec sectio tion n wil will l be sav saved. ed. (Only (Only if the directo directory ry # doe does s not alr alread eady y exi exist) st). . if (! dir.exists("Figures" "Figures")) )) dir.create ("Figures" "Figures") )
# Le Let t s now plot the first 10 per person sonal al net networ works. ks. We won t pr prin int t th the e pl plot ots s in th the e # GUI, but we ll ex expo port rt ea each ch of th them em to a se sepa para rate te pn png g fi file le. . for for (i in 1:10 10) ) {
# Ge Get t th the e gr grap aph h gr <- graph.lis graph.list[[i] t[[i]] ] # Ge Get t th the e gr grap aph h s ego ID ego_ID ego_ID <- names(graph.list)[i]
# Set seed for reproduc reproducibi ibilit lity y (so we alw always ays get the sam same e net networ work k lay layout out). ). 613) ) set.seed(613 # Open Open pn png g de devi vice ce to pr prin int t th the e pl plot ot to an ex exte tern rnal al pn png g fi file le (n (not ote e th that at th the e eg ego o # ID is wr writ itte ten n to th the e fi file le na name me). ). "./Figures/plot.", , ego_ID, ego_ID, ".png", ".png" , sep="" sep=""), ), png(file= paste("./Figures/plot." width= 800 800, , height= 800 800) )
35
15
s n a i l a t I f o10 e e r g e d g v A 5
0
0
5
10
15
20
N Italians
## Print Print plot plot to extern external al pdf file "./Figures/N.avg.deg.ita.2.pdf", , width= 8, height= 6) pdf("./Figures/N.avg.deg.ita.2.pdf" print(p) dev.off() ## pdf pdf ## 2
# Sam Same e as abo above, ve, wit with h col color or rep repres resent enting ing ego s educ educatio ational nal leve level. l.
## Set seed for reproducibil reproducibility ity (jittering (jittering is random.) random.) 613) ) set.seed(613 ## Get Get and and save save plot plot p <- ggplot(data= data, aes(x= N.ita, y= avg.deg.ita avg.deg.ita)) )) + size=1.5 1.5, , aes(color= as.factor(educ)), position= position_jitter(height= height=0.5 0.5, , width=0.5 width= 0.5)) )) + geom_point(size= theme(legend.position= legend.position="bottom" "bottom") ) + labs(x= "N Ital Italians" ians", , y= "Avg "Avg degre degree e of Ital Italians ians" " , color= "Education") "Education" ) ## Note Note that that we slight slightly ly jitter jitter the points points to avoid avoid overpl overplott otting ing. . ## Prin Print t plot plot in R GUI GUI print(p)
40
0.4 k r o w t e n l 0.3 a n o s r e p n i s0.2 n a i l a t I p o r P0.1
0.0 1980
1990
2000
2010
Year of arrival in Italy ## Print Print plot plot to extern external al pdf file pdf("./Figures/prop.ita.arr.pdf" "./Figures/prop.ita.arr.pdf", , width= 8, height= 6) print(p) dev.off() ## pdf pdf ## 2 # Do Sri Lankans Lankans with mor more e Ita Italia lians ns in the their ir per person sonal al net networ work k hav have e a hig higher her # inco income? me?
## To avoid avoid warnin warnings gs from from ggplot ggplot(), (), let s remo remove ve case cases s with with NA valu values es on the the ## relevant relevant variable variables. s. index index <- complete.cases (ego.data[,c("prop.ita" "prop.ita", , "income")]) "income")]) data <- ego.data[ ego.data[index index,] ,] ## Set seed for reproducibil reproducibility ity (jittering (jittering is random.) random.) set.seed(613 613) ) ## Get Get and and save save plot plot p <- ggplot(data= data, aes(x= prop.ita, y= income)) income)) + geom_jitter (width= 0.01, 0.01, height= 50 50, , shape= 1, s ## Print plot in R GUI print(p)
42