100 Data Science in R Interview Questions and Answers for 2016 04 Dec 2015
Latest Update made on October November 22, 2016. R Programming is Programming is one of the languages that data scientists have to be fami famili liar ar with with.. In most ost of the the Data Data Scie Scienc nce e job job inte interv rvie iews ws questions surrounding coding in R will be asked and it is expected that applicants are well versed with the nitt!gritties of R. "e got together with our data science faculties who are experts in the #eld $ having worked worked as Sr. Sr. Data Scientists themselves% to bring together a list of questions that might be asked in data science interviews. &hese interview questions are just related related to R programming programming and though it is not an exhaustive list% it will be useful to go through it while preparing for data science jobs. to get the '()* data scientist salar report delivered to
CLICK HERE
our inbox+ If you would would lie lie more more inform informati ation on about about !ata !ata "cien "cience ce caree careers, rs, please clic clic t#e oran$e oran$e %&e'uest %&e'uest Info% button button on top of t#is pa$e. pa$e. Data Science is a vast #eld% that is true% but when it comes to cracking interviews for data science jobs% knowledge of either R or Pthon is important to get started with our Data Science career. Pthon ,eing a Data Scientist means that ou are usuall appling for the top level position. - fresh college graduate will not be hired for a data data scie scient ntis istt posit ositio ion% n% as this this is one one posi positi tion on that that dem demands ands exper xperie ienc nce% e% matu maturi rit t and and in dept depth h know knowle ledg dge e of data data scie scienc nce e conce concepts pts and and the the indu industr str that that data data scien scientis tists ts get get hire hired d in. in. &he
inte interv rvie iew w quest uestio ions ns belo below w ar are e spec speci# i#c c to the the R prog progra ramm mmin ing g lang langua uage ge that that has has rema emained ined the the pref prefer errred lan languag guage e for for data data scientists through the ears. In our previous post for )(( Data Science Interview uestions% uestions % we had had list listed ed all all the the gene genera rall stat statis isti tics cs%% data data%% math mathem emat atic ics s and conceptual conceptual questions that are asked asked in the interviews. &hese articles have have been been divi divide ded d into into / part parts s whic which h focu focus s on ea each ch topi topic c wise wise distribution of interview questions. ,elow are some of the questions that mabe asked during a data science interview% that is related to R programing speci#call. speci#call.
Data Science Interview Questions and Answers in R Programming 1) How can you merge two data frames in R anguage! Data frames in R language can be merged manuall using cbind 01 functions or b using the merge 01 function on common rows or columns. 2) "#$ain a%out data im$ort in R anguage R 2ommander is used to import data in R language. &o start the R commander 34I% the user must tpe in the command Rcmdr into the console. &here are / di5erent was in which data can be imported in R language! 6
4sers can select the data set in the dialog box or enter the
name of the data set 0if the know1.
inte interv rvie iew w quest uestio ions ns belo below w ar are e spec speci# i#c c to the the R prog progra ramm mmin ing g lang langua uage ge that that has has rema emained ined the the pref prefer errred lan languag guage e for for data data scientists through the ears. In our previous post for )(( Data Science Interview uestions% uestions % we had had list listed ed all all the the gene genera rall stat statis isti tics cs%% data data%% math mathem emat atic ics s and conceptual conceptual questions that are asked asked in the interviews. &hese articles have have been been divi divide ded d into into / part parts s whic which h focu focus s on ea each ch topi topic c wise wise distribution of interview questions. ,elow are some of the questions that mabe asked during a data science interview% that is related to R programing speci#call. speci#call.
Data Science Interview Questions and Answers in R Programming 1) How can you merge two data frames in R anguage! Data frames in R language can be merged manuall using cbind 01 functions or b using the merge 01 function on common rows or columns. 2) "#$ain a%out data im$ort in R anguage R 2ommander is used to import data in R language. &o start the R commander 34I% the user must tpe in the command Rcmdr into the console. &here are / di5erent was in which data can be imported in R language! 6
4sers can select the data set in the dialog box or enter the
name of the data set 0if the know1.
6
Data Data can can also also be ente entere red d dire direct ctl l usin using g the the edit editor or of R
2omma 2ommande nderr via Data! Data!78e 78ew w Data Data Set. Set. 9oweve 9owever% r% this this works works we well ll when the data set is not too large. 6
Data can also be imported imported from from a 4R: or from from a plain plain text text
#le 0-S2II1% from an other statistical package or from the clipboard. &) 'wo vectors ( and are de*ned as foows + ( ,- c.&/ 2/ ) and ,- c.1/ 2) 3at wi %e out$ut of vector 4 t3at is de*ned as 4 ,- (5 In R lan language when hen the the vec vectors tors have di5er ere ent len length gths% the multiplication begins with the smaller vector and continues till all the elements in the larger vector have been multiplied. &he output of the above code will be $ ;
missing
vaues
and
im$ossi%e
vaues
are
re$resented in R anguage! 8a8 08ot a 8umber1 is used to represent represent impossible values whereas whereas 8- 08ot -vailable1 is used to represent missing values. &he best wa to answer this question would be to mention that deleting missing values is not a good idea because the probable cause for missing value could be some problem with data collection or programming or the quer. It is good to #nd the root cause of the missing values and then take necessar steps handle them. ) R anguage 3as severa $ac7ages for soving a $articuar $ro%em How do you ma7e a decision on w3ic3 one is t3e %est to use!
2R-8 package ecosstem has more than *((( packages. &he best wa for beginners to answer this question is to mention that the would look for a package that follows good software development principles. &he next thing would be to look for user reviews and #nd out if other data scientists or analsts have been able to solve a similar problem. 6) 3ic3 function in R anguage is used to *nd out w3et3er t3e means of 2 grou$s are e8ua to eac3 ot3er or not! t.tests 01 9) 3at is t3e %est way to communicate t3e resuts of data anaysis using R anguage! &he best possible wa to do this is combine the data% code and analsis results in a single document using knitr for reproducible research. &his helps others to verif the #ndings% add to them and engage in discussions. Reproducible research makes it eas to redo the experiments b inserting new data and appling it to a di5erent problem. :) How many data structures does R anguage 3ave! R language has 9omogeneous and 9eterogeneous data structures. 9omogeneous data structures have same tpe of objects $ >ector% ?atrix ad -rra. 9eterogeneous data structures have di5erent tpe of objects $ Data frames and lists. ;) 3at is t3e vaue of f .2) for t3e foowing R code!
b
@ b
&he answer answer to the above code snippet is /E. /E. &he value of FaG FaG passed to the function is ' and the value for FbG de#ned in the function f 0a1 is /. So the output would be /A/ B g 0'1. &he function g is de#ned in the global environment and it takes the value of b as =0due to lexical scoping in R1 not / returning a value '=H to the function f. &he result will be /A/BH /E. 10) 3at is t3e $rocess to create a ta%e in R anguage wit3out using e#terna *es! ?&ableH data.frame 01 edit 0?&able1 &he above code will open an Jxcel Spreadsheet Spreadsheet for entering data into ?&able. :earn Data Science in R Programming to Programming to land a top gig as an Jnterprise Data Scientist+ 11)
"#$ain
a%out
t3e
signi*cance
of
trans$ose
in
R
anguage &ranspose &ranspose t 01 is the easiest method for reshaping reshaping the data before analsis.
12) 3at are wit3 .) and < .) functions used for! "ith 01 function is used to appl an expression for a given dataset and ,K 01 function is used for appling a function each level of factors. 1&)
d$yr
$ac7age
is
used
to
s$eed
u$
data
frame
management code 3ic3 $ac7age can %e integrated wit3 d$yr for arge fast ta%es! data.table 1) In %ase gra$3ics system/ w3ic3 function is used to add eements to a $ot! boxplot 01 or text 01 1) 3at are t3e di=erent ty$e of sorting agorit3ms avaia%e in R anguage! ,ucket Sort Selection Sort uick Sort ,ubble Sort ?erge Sort 1) 3at is t3e command used to store R o%>ects in a *e! save 0x% #leHGx.RdataG1 16) 3at is t3e %est way to use Hadoo$ and R toget3er for anaysis! 9DLS can be used for storing the data for long!term. ?apReduce jobs submitted from either MoNie% Pig or 9ive can be used to encode%
improve and sample the data sets from 9DLS into R. &his helps to leverage complex analsis tasks on the subset of data prepared in R. 19) 3at wi %e t3e out$ut of og .-:) w3en e#ecuted on R consoe! Jxecuting the above on R console will displa a warning sign that 8a8 08ot a 8umber1 will be produced because it is not possible to take the log of negative number. 1:) How is a Data o%>ect re$resented internay in R anguage! unclass 0as.Date 0F'()*!)(!(EO11 1;) 3at wi %e t3e out$ut of t3e %eow code printmessage
if 0is.na 0a11
print 0a is a missing value+ 1
else if 0a < (1
print 0a is less than Nero1
else
print 0a is greater than or equal to Nero1
invisible 0a1
C
printmessage 08-1
&he output for the above R programming code will be Fa is a missing value.G &he function is.na 01 is used to check if the input passed is a missing value. 20) 3ic3 $ac7age in R su$$orts t3e e#$oratory anaysis of genomic data! adegenet 21) 3at is t3e di=erence %etween data frame and a matri# in R! Data frame can contain heterogeneous inputs while a matrix cannot. In matrix onl similar data tpes can be stored whereas in a data frame there can be di5erent data tpes like characters% integers or other data frames. 22) How can you add datasets in R! rbind 01 function can be used add datasets in R language provided the columns in the datasets should be same. 2&) How do you s$it a continuous varia%e into di=erent grou$s?ran7s in R! 2) 3at are factor varia%e in R anguage! Lactor variables are categorical variables that hold either string or numeric values. Lactor variables are used in various tpes of
graphics and particularl for statistical modelling where the correct number of degrees of freedom is assigned to them. 2) 3at is t3e memory imit in R! &, is the memor limit for *=!bit sstem memor and /3, is the limit for /'!bit sstem memor. 26) 3at are t3e data ty$es in R on w3ic3 %inary o$erators can %e a$$ied! Scalars% ?atrices ad >ectors. 29) How do you create og inear modes in R anguage! 4sing the loglm 01 function 2:) 3at wi %e t3e cass of t3e resuting vector if you concatenate a num%er and @A! number 2;) 3at is meant %y -nearest neig3%our! Q!8earest 8eighbour is one of the simplest machine learning classi#cation algorithms that is a subset of supervised learning based
on
laN
learning.
In
this
algorithm
the
function
is
approximated locall and an computations are deferred until classi#cation. &0) 3at wi %e t3e cass of t3e resuting vector if you concatenate a num%er and a c3aracter! character &1) rite code to %uid an R function $owered %y B! &2) If you want to 7now a t3e vaues in c .1/ &/ / 9/ 10) t3at are not in c .1/ / 10/ 12/ 1) 3ic3 in-%uit function in
R can %e used to do t3is! Aso/ 3ow t3is can %e ac3ieved wit3out using t3e in-%uit function 4sing in!built function ! setdi50c 0)% /% E% % )(1% c 0)% E% )(% ))% )/11 "ithout using in!built function ! c 0)% /% E% % )(1 + c 0)% /% E% % )(1 TinT c 0)% E% )(% ))% )/1. &&) How can you de%ug and test R $rogramming code! R code can be tested using 9adleUs testthat package. &) 3at wi %e t3e cass of t3e resuting vector if you concatenate a num%er and a ogica! number &) rite a function in R anguage to re$ace t3e missing vaue in a vector wit3 t3e mean of t3at vector mean impute ect is not a%e to 3ande an event! &he event is dispatched to the delegate for processing. &9) Di=erentiate %etween a$$y and sa$$y If the programmers want the output to be a data frame or a vector% then sappl function is used whereas if a programmer wants the output to be a list then lappl is used. &here one more function known as vappl which is preferred over sappl as vappl allows the programmer to speci#c the output tpe. &he disadvantage of using vappl is that it is diXcult to be implemented and more verbose. &:) Di=erentiate %etween se8 .6) and se8Caong .6)
SeqYalong0*1 will produce a vector with length * whereas seq0*1 will produce a sequential vector from ) to * c0 0)%'%/%=%E%*11. &;) How wi you read a csv *e in R anguage! read.csv 01 function is used to read a .csv #le in R language. ,elow is a simple example $ #lcontent ect (E is a matric data o%>ect! If the function call is.matrix0[ 1 returns &R4J then [ can be termed as a matrix data object. 2) 3at do you understand %y eement recycing in R! If two vectors with di5erent lengths perform an operation $the elements of the shorter vector will be re!used to complete the operation. &his is referred to as element reccling. Jxample $ >ector - ector ,ect (E is a matri# data o%>ect! If the function call is.matrix0[1 returns true then [ can be considered as a matrix data object otheriwse not.
) How wi you measure t3e $ro%a%iity of a %inary res$onse varia%e in R anguage! :ogistic regression can be used for this and the function glm 01 in R language provides this functionalit. ) 3at is t3e use of sam$e and su%set functions in R $rogramming anguage! Sample 01 function can be used to select a random sample of siNe \nU from a huge dataset. Subset 01 function is used to select variables and observations from a given dataset. 6) '3ere is a function fn.a/ %/ c/ d/ e) a F % 5 c - d ? e rite t3e code to ca fn on t3e vector c.1/2/&//) suc3 t3at t3e out$ut is same as fn.1/2/&//) do.call 0fn% as.list0c 0)% '% /% =% E111 9) How can you resam$e statistica tests in R anguage! 2oin package in R provides various options for re!randomiNation and permutations based on statistical tests. "hen test assumptions cannot be met then this package serves as the best alternative to classical methods as it does not assume random sampling from well! de#ned populations. :) 3at is t3e $ur$ose of using @e#t statement in R anguage! If a developer wants to skip the current iteration of a loop in the code without terminating it then the can use the next statement. "henever the R parser comes across the next statement in the
code% it skips evaluation of the loop further and jumps to the next iteration of the loop. ;) How wi you create scatter$ot matrices in R anguage! - matrix of scatterplots can be produced using pairs. Pairs function takes various parameters like formula% data% subset% labels% etc. &he two ke parameters required to build a scatterplot matrix are $ •
formula! - formula basicall like ]aBbBc . Jach term gives a separate variable in the pairs plots where the terms should be numerical vectors. It basicall represents the series of variables used in pairs.
•
data! It basicall represents the dataset from which the variables have to be taken for building a scatterplot. 0) How wi you c3ec7 if an eement 2 is $resent in a vector! &here are various was to do this!
i.
It can be done using the match 01 function! match 01 function returns the #rst appearance of a particular element.
ii.
&he other is to use TinT which returns a ,oolean value either true or false.
iii.
Is.element 01 function also returns a ,oolean value either true or false based on whether it is present in a vector or not. 1) 3at is t3e di=erence %etween i%rary.) and re8uire.) functions in R anguage!
&here is no real di5erence between the two if the packages are not being loaded inside the function. require 01 function is usuall used inside function and throws a warning whenever a particular package is not found. Mn the ^ip side% librar 01 function gives an error message if the desired package cannot be loaded. 2) 3at are t3e rues to de*ne a varia%e name in R $rogramming anguage! - variable name in R programming language can contain numeric and alphabets along with special characters like dot 0.1 and underline 0!1. >ariable names in R language can begin with an alphabet or the dot smbol. 9owever% if the variable name begins with a dot smbol it should not be a followed b a numeric digit. &)
3at
do
you
understand
%y
a
wor7s$ace
in
R
$rogramming anguage! &he current R working environment of a user that has user de#ned objects like lists% vectors% etc. is referred to as "orkspace in R language. ) 3ic3 function 3e$s you $erform sorting in R anguage! Mrder 01 ) How wi you ist a t3e data sets avaia%e in a R $ac7ages! 4sing
the
data0package 6)
3ic3
visuaisation
below H
function in
line
.packages0all.available is R
used
to
create
$rogramming
of H a
code! &R4J11 3istogram anguage!
9ist01 9) rite t3e synta# to set t3e $at3 for current wor7ing
directory
in
R
environment
Setwd0FdirYpathG1 :) How wi you dro$ varia%es using indices in a data frame! :etUs
take
a
dataframe
df
data.frame0v)Hc0)_E1%v'Hc0'_*1%v/Hc0/_1%v=Hc0=_11 df
ZZ v) v' v/ v=
ZZ ) ) ' / =
ZZ ' ' / = E
ZZ / / = E *
ZZ = = E *
ZZ E E *
Suppose we want to drop variables v' ` v/ % the variables v' and v/ can be dropped using negative indicies as follows! df)
ZZ v) v=
ZZ ) ) =
ZZ ' ' E
ZZ / / *
ZZ = =
ZZ E E
;) 3at wi %e t3e out$ut of runif .9)! It will generate randowm numbers between ( and ). 60)
3at
is
t3e
di=erence
%etween
rnorm
and
runif
functions ! rnorm function generates n normal random numbers based on the mean and standard deviation arguments passed to the function. Synta# of rnorm function rnorm0n% mean H % sd H 1
runif function generates n unform random numbers in the interval of minimum and maximum values passed to the function. Synta# of runif function runif0n% min H % max H 1
61) 3at wi %e t3e out$ut on e#ecuting t3e foowing R $rogramming code + mat
sum0mat1 62) How wi you com%ine muti$e di=erent string i7e DataE/ ScienceE/ inE /RE/ ProgrammingE as a singe string DataCScienceCinCRCProgrammmingE ! paste0FDataG% FScienceG% FinG %FRG% FProgrammingG%sepHY1 6&) rite a function to e#tract t3e *rst name from t3e string Gr 'om 3iteE substr 0F?r. &om "hiteG%startHE% stopH1 6) Ban you te if t3e e8uation given %eow is inear or not ! "m$Csa 2000F2.em$Cage) 2 Kes it is a linear equation as the coeXcients are linear. 6) 3at wi %e t3e out$ut of t3e foowing R $rogramming code ! var2,- c.I/Jove/De4yre) var2 It will give an error. 66) 3at wi %e t3e out$ut of t3e foowing R $rogramming code! #,- if.#KK20) $rint.( is an even num%er) ese $rint.( is an odd num%er) Jxecuting the above code will result in an error as shown below !
ZZ Jrror_ _=_)_ unexpected else ZZ /_ print0[ is an even number1 ZZ =_ else ZZ
A
R programming language does not know if the else related to the #rst \ifU or not as the #rst if01 is a complete command on its own. 69) I 3ave a string contactLdeMyrecom 3ic3 string function can %e used to s$it t3e string into two di=erent strings contactLdeMyreE and comE ! &his can be accomplished using the strsplit function which splits a string based on the identi#er given in the function call. &he output of strsplit01 function is a list. strsplit0contactdeNre.com%split H .1 Mutput of the strsplit function is ! ZZ )VV ZZ )V contactdeNre com 6:) 3at is R
in
R
environment
like
arithmetic
calcualtions%
inputoutput. 6;) How wi you merge two dataframes in R $rogramming anguage!
?erge 01 function is used to combine two dataframes and it identi#es common rows or columns between the ' dataframes. ?erge 01 function basicall #nds the intersection between two di5erent sets of data. ?erge 01 function in R language takes a long list of arguments as follows $ Sntax for using ?erge function in R language ! merge 0x% % b.x% b.% all.x or all. or all 1 •
•
[ represents the #rst dataframe. K represents the second dataframe.
•
b.[! >ariable name in dataframe [ that is common in K.
•
b.K! >ariable name in dataframe K that is common in [.
•
all.x ! It is a logical value that speci#es the tpe of merge. all.[ should be set to true% if we want all the observations from dataframe [ . &his results in :eft oin.
•
all. ! It is a logical value that speci#es the tpe of merge. all. should be set to true % if we want all the observations from dataframe K . &his results in Right oin.
•
all $ &he default value for this is set to L-:SJ which means that onl matching rows are returned resulting in Inner join. &his should be set to true if ou want all the observations from dataframe [ and K resulting in Muter join. 90) rite t3e R $rogramming code for an array of words so t3at t3e out$ut is dis$ayed in decreasing fre8uency order
R Programming 2ode to displa output in decreasing frequenc order ! tt
Mutput ! )1 a a) b '1 / / '
91) How to c3ec7 t3e fre8uency distri%ution of a categorica varia%e! &he frequenc distribution of a categorical variable can be checked using the table function in R language. &able 01 function calculates the count of each categories of a categorical variable. genderHfactor0c0F?G%GLG%G?G%GLG%GLG%GLG11 table0sex1 Nut$ut of t3e a%ove R Bode + 3ender L ? = ' Programmers can also calculate the T of values for each categorical group b storing the output in a dataframe and appling the column percent function as shown below !
t
H
data.frame0table0gender11
tpercentH round0tLreq sum0tLreq1)((%'1
92)
Oender
re8uency
Percent
L
=
**.*
?
'
//.//
3at
is
t3e
$rocedure
to
c3ec7
t3e
cumuative
fre8uency distri%ution of any categorica varia%e! &he cumulative frequenc distribution of a categorical variable can be checked using the cumsum 01 function in R language. "#am$e + gender
H
factor0c0f%m%m%f%m%f11 H table0gender1
cumsum01 Nut$ut of t3e a%ove R code2umsum01 fm // 9&) 3at wi %e t3e resut of muti$ying two vectors in R 3aving di=erent engt3s! &he multiplication of the two vectors will be performed and the output will be displaed with a warning message like $ F:onger object length is not a multiple of shorter object length.G Suppose there is a vector a
the warning message. &he multiplication is performed in a sequential manner but since the length is not same% the #rst element of the smaller vector b will be multiplied with the last element of the larger vector a.
1 3at is R! R is a programming language which is used for developing statistical software and data analsis. 2 How R commands are written! , using Z at the starting of the line of code like Zdivision commands are written. &3at is t-tests.) in R! It is used to determine that the means of two groups are equal or not b using t.test01 function. 3at are t3e disadvantages of R Programming! &he disadvantages are_! 6 :ack of standard 34I 6 8ot good for big data. 6 Does not provide spreadsheet view of data. 3at is t3e use of it3 .) and
11How you can $roduce co-reations and covariances! 2or!relations is produced b cor01 and covariances is produced b cov01 function. 123at is di=erence %etween matri# and dataframes! Dataframe can contain di5erent tpe of data but matrix can contain onl similar tpe of data. 1&3at is di=erence %etween a$$y and sa$$y! lappl is used to show the output in the form of list whereas sappl is used to show the output in the form of vector or data frame. 1 3at is t3e di=erence %etween se8.) and se8Caong.)! Seq0=1 means vector from ) to = 0c0)%'%/%=11 whereas seqYalong0=1 means a vector of the length0=1 or )0c0)11. 1 "#$ain 3ow you can start t3e R commander OI! rcmdr command is used to start the R commander 34I. 16 3at is t3e memory imit of R! In /' bit sstem memor limit is /3b but most versions limited to '3b and in *= bit sstem memor limit is &b. 19How many data structures R 3as! &here are E data structure in R i.e. vector% matrix% arra which are of homogenous tpe and other two are list and data frame which are heterogeneous. 1: "#$ain 3ow data is aggregated in R! &here are two methods that is collapsing data b using one or more ,K variable and other is aggregate01 function in which ,K variable should be in list. 1; How many sorting agorit3ms are avaia%e! there are E tpes of sorting algorithms are used which are_! 6 ,ubble Sort 6 Selection Sort 6 ?erge Sort 6 uick Sort 6 ,ucket Sort 20How to create new varia%e in R $rogramming! Lor creating new variable assignment operator \
It is used for experimental design .It is used to determine the e5ect of given sample siNe. 263ic3 $ac7age is used for $ower anaysis in R! Pwr package is used for power analsis in R. 293ic3 met3od is used for e#$orting t3e data in R! &here are man was to export the data into another formats like SPSS% S-S % Stata % Jxcel Spreadsheet. 2:3ic3 $ac7ages are used for e#$orting of data! Lor excel xlsRead"rite package is used and for sas%spss %stata foreign package is implemented. 2; How im$ossi%e vaues are re$resented in R! In R 8a8 is used to represent impossible values. &03ic3 command is used for storing R o%>ect into a *e! Save command is used for storing R objects into a #le. Sntax_ 7save0N%#leHGN.RdataG1 &1 3ic3 command is used for restoring R o%>ect from a *e! load command is used for storing R objects from a #le. Sntax_ 7load0GN.RdataG1 &23at is t3e use of coin $ac7age in R! coin package is used to achieve the re randomiNation or permutation based statistical tests. &&3ic3 function is used for sorting in R! order01 function is used to perform the sorting. &3at is t3e use of ta$$y! IMS!*.)./ &3at 3a$$ens w3en t3e a$$ication o%>ect does not 3ande an event! the event will be dispatched to our delegate for processing. &6"#$ain a$$ s$eci*c o%>ects w3ic3 store t3e a$$ contents! Data model objects are app speci#c objects and store appUs content. -pps can also use document objects. &9"#$ain t3e $ur$ose of using Iindow o%>ect! 4I"indow object coordinates the one or more views presenting on the screen. &:'e me t3e su$er cass of a view controer o%>ects! 4I>iew 2ontroller class. &;How to create a#es in t3e gra$3! 4sing axes01 function custom axes are created. 03at is t3e use of a%ine.) function! abline01 function is add the reference line to a graph. Sntax_! abline0hHvalues% vHxvalues1 13y vcd $ac7age is used! vcd package provides di5erent methods for visualiNing multivariate categorical data. 2 3at is OOo%i! 33obi is an open source program for visualiNation for exploring high dimensional tped data. &3at is iPots!
It is a package which provide bar plots% mosaic plots% box plots% parallel plots% scatter plots and histograms. 3at is t3e use of attice $ac7age! lattice package is to improve on base R graphics b giving better defaults and it have the abilit to easil displa multivariate relationships. 3at is *tdistr.) function! It is used to provide the maximum likelihood #tting of univariate distributions. It is de#ned under the ?-SS package. 63ic3 data structures are used to $erform statistica anaysis and create gra$3s Data structures are vectors% arras% data frames and matrices. 93at is t3e use of sin7.) function! It de#nes the direction of output. : 3y i%rary.) function is used! &his function is used to show the packages which are installed. ;3y searc3.) function is used! , this function we see that which packages are currentl loaded. 0 Nn w3ic3 ty$e of data %inary o$erators are wor7ed! ,inar operators are worked on matrices% vectors and scalars. 1 3at is t3e use of do< $ac7age! It is used to de#ne the desired table using function and model formula. 2 3ic3 function is used to create fre8uency ta%e! Lrequenc table is created b table01 function. &De*ne ogm.) function :oglm01 function is used to create log!linear models. 3at is t3e use of corrgram.) function! corrgram01 function is used to plot correlograms. How to create scatter$ot matrices! Pair01 or splom01 function is used for create scatterplot matrices. 6 3at is n$mc! It is a package which gives nonparametric multiple comparisons. 9 3at is t3e use of diagnostic $ots! It is used to check the normalit% heteroscedasticit and in^uential observations. :De*ne anova.) function anova01 is used to compare the nested models. ;3at is cvm.) function! It is de#ned under the D--3 package which is used for k!fold validation. 60 De*ne ste$AIB.) function It is de#ne under the ?-SS package which performs stepwise model selection under exact -I2. 61 De*ne ea$s.) It is used to perform the all!subsets regression and it is de#ned under the leaps package. 62De*ne reaim$o $ac7age It is used to measure the relative importance of each of the predictor in the model. 6&3y car $ac7age is used!
It provide a variet of regression including scatter plots% variable plots and it also enhanced diagnostic. 6 De*ne ro%ust $ac7age It provides a librar of robust methods including regression. 6 3at is ro%ust%ase! It is a package which provides basic robust statistics including model selection methods. 66 De*ne $otmeans.) It is de#ne under gplots package which includes con#dence intervals and it produces mean plot for single factors. 693at is t3e fu form of GA@NA! ?-8M>- stands for multivariate analsis of variance. 6: 3at is t3e use of GA@NA! , using ?-8M>- we can test more than one dependent variable simultaneousl. 6; De*ne ms3a$irotest. ) It is a function which de#nes in mvnormtest package. It produces the Shapiro! wilk test for multivariate normalit. 90 De*ne %aretttest.) ,arlett.test01 is used to provide a parametric k!sample test of the equalit of variances. 913at is ignertest.)! It is a function which provides a non!parametric k sample test of the equalit of variances. 92De*ne 3ov$ot.) It is de#ne in 99 package which provides a graphic test of homogeneit of variance based on brown forsth. 9&3ic3 varia%es are re$resented %y ower case etters! 8umerical variables are represented b lower case letters. 9 3ic3 varia%es are re$resented %y u$$er case etters! 2ategorical factors are represented b upper case letters. 93at is ogistic regression! :ogistic regression is used to predict the binar outcome from the given set of continuous predictor variables. 96De*ne Poison regression It is used to predict the outcome variable which represents counts from the given set of continuous predictor variable. 99De*ne Surviva anaysis It includes number of techniques which is used for modeling the time to an event. 9: 3at is t3e use surv*t.) function! It estimates a survival distribution one or more groups. 9; De*ne survdi=.) It determines the di5erences in survival distribution between two or more groups. :03at is co#$3.)! It is a function which is used to model the haNard function on the set of predictor variable. :1 In w3ic3 $ac7age surviva anaysis is de*ned! Survival analsis is de#ned under the survival package.
:23at is t3e use of GASS $ac7age! ?-SS functions include those functions which performs linear and quadratic discriminant function analsis. :& De*ne 8da.) qda01 prints a quadratic discriminant function. :De*ne da.) lda01 is used to print the discriminant functions which is based on centered variable. : 3at is t3e use of forecast $ac7age! It provides the functions which are used for automatic selection of -RI?- and exponential models. :6De*ne autoarima.) It is used to handle the seasonal as well as non!seasonal -RI?- models. :93at is $rinci$a.) function! It is de#ne in psch package which is used to rotate and extract the principal componants. ::3at is actoGineR! It is a package which includes quantitative and qualitative variables. It also includes supplementar variables and observations. :;3at is t3e fu form of BA! 2L- stands for 2on#rmator Lactor -nalsis. ;03at is t3e use of %ootsem.) function! It is used to bootstrap the structural equation model. ;13at is t3e fu form of S"G! SJ? stands for Structural Jquation ?odeling. ;2 3ic3 function $erforms cassica mutidimensiona scaing! cmdscale01 function is used to perform classical multidimensional scaling. ;&De*ne isoGDS.) &his function is de#ned under the ?-SS package which performs nonmetric multidimensional scaling. ;3ic3 function $erform individua di=erence scaing! It is done b indscal01 function. ; 3at is $vcust.) function ! It comes under the pvclust package which provides p!values for hierarchical clustering . ;6De*ne custerstats.) ! It is de#ne in fpc package which provide a method for comparing the similarit of two clusters solution using di5erent validation criteria. ;93at we use $arty $ac7age! It is used to provide a non!parametric regression for ordinal% nominal% censored and multivariate responses. ;: 3ic3 $ac7age $rovide t3e %ootstra$$ing! boot package is used which provide bootstrapping. ;;De*ne mata% $ac7age ?atlab package includes those wrapper functions and variable which are used to replicate matlab function calls. 1003at is t3e of use Gatri# $ac7age!
?atrix package includes those function which support sparse and dense matrices like :apack% ,:-S etc.
R Programming: 35 Job Interview Questions and Answers • •
Posted by Laetitia Van Cauwenberge on December 6, 2015 at 9:00am View Bog
!ead t"e #uestions$ %t t"e bottom, you wi &ind a in' to t"e answers$
The Questions First Set
1$ 2$ -$ $ 5$ 6$ $ 7$ 9$ 10$ 11$
()*ain w"at is !+ List out some o& t"e &unction t"at ! *roides+ ()*ain "ow you can start t"e ! commander ./+ n ! "ow you can im*ort Data+ ention w"at does not 3!4 anguage do+ ()*ain "ow ! commands are written+ ow can you sae your data in !+ ention "ow you can *roduce co8reations and coariances+ ()*ain w"at is t8tests in !+ ()*ain w"at is it" ; and By ; &unction in ! is used &or+ "at are t"e data structures in ! t"at is used to *er&orm statistica anayses and create gra*"s+ 12$ ()*ain genera &ormat o& atrices in !+ 1-$ n ! "ow missing aues are re*resented + 1$ ()*ain w"at is trans*ose+
15$ ()*ain "ow data is aggregated in !+ 16$ "at is t"e &unction used &or adding datasets in !+ 1$ "at is t"e use o& subset; &unction and sam*e; &unction in ! + 17$ ()*ain "ow you can create a tabe in ! wit"out e)terna &ie+
1$
Data structure 88 ow many data structures ! "as+ ow do you buid a binary searc" tree in !+ 2$ =orting 88 ow many sorting agorit"ms are aaiabe+ ="ow me an e)am*e in !$ -$ Low ee 88 ow do you buid a ! &unction *owered by C+ $ =tring 88 ow do you im*ement string o*eration in !+ 5$ Vectori>ation 88 & you want to do onte Caro simuation by !, "ow do you im*roe t"e e&&iciency+ 6$ ?unction 88 ow do you ta'e &unction as argument o& anot"er &unction+ "at is t"e a**y; &unction &amiy+ $ @"reading 88 ow do you do muti8t"reading in !+ 7$ emory imit and database 88 "at is t"e memory imit o& !+ ow do you aoid it+ ow do you use =AL in !+ 9$ @esting 88 ow do you do testing and debugging in !+ 10$ =o&tware deeo*ment 88 ow do you deeo* a *ac'age+ ow do you do ersion contro+
1$ & "ae a data$&rame df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6), c(7, 8, 9))$$$ ow do seect t"e c(4, 5, 6) + ow do seect t"e 1+ ow do seect t"e 5+ "at is df[, 3]+ "at is df[1,]+ "at is df[2, 2]+ 2$ "at is t"e di&&erence between a matri) and a data&rame+
-$ & concatenate a number and a c"aracter toget"er, w"at wi t"e cass o& t"e resuting ector be+ $ "at i& concatenate a number and a ogica+ 5$ "at i& concatenate a number and NA+ 6$ "at is t"e di&&erence between sa!" and !a!"+ "en s"oud you use one ersus t"e ot"er+ Bonus: "en s"oud you use #a!" + $ "at is t"e di&&erence between se$(4) and se$%a!&'(4) + 7$ "at is f(3) w"ere: " <- 5 f <- f'ct*&'(+) " <- 2 "2 / (+) 0 <- f'ct*&'(+) + / " 0
"y+ 9$ want to 'now a t"e aues in c(1, 4, 5, 9, 1) t"at are not in c(1, 5, 1, 11, 13) $ ow do do t"at wit" one buit8in &unction in !+ ow coud do it i& t"at &unction didnt e)ist+
10$ Can you write me a &unction in ! t"at re*aces a missing aues o& a ector wit" t"e mean o& t"at ector+ 11$ ow do you test ! code+ Can you write a test &or t"e &unction you wrote in 6+ 12$ =ay "ae$$$ f'(a, b, c, d, e) a / b c - d e ow do ca f' on t"e ector c(1, 2, 3, 4, 5) so t"at get t"e same resut as f'(1, 2, 3, 4, 5) + o need to te me t"e resut, Eust "ow to do it$;
1-$ d*yr F8 Ggg*ot2G ibraryd*yr; "y does t"e d*yr *ac'age get oaded and not gg*ot2+ 1$ mysteryHmet"od F8 &unction); I &unction>; !educe&unctiony, w; wy;, ), >; J &n F8 mysteryHmet"odc&unction); ) K 1, &unction); ) );; &n- ; "at is t"e aue o& f'(3)+ Can you e)*ain w"at is "a**ening at eac" ste*+
).1 If I have a data.frame df
-nswer_ seq0=1 produces a vector from ) to = 0 c0)% '% /% =1 1% whereas seqYalong0=1 produces a vector of length0=1% or ) 0c0)11. *.1 "hat is f0/1 where_
"h -nswer_ )'. In f0/1% is '% so A' is =. "hen evaluating g0/1% is the globall scoped 0E1 instead of the that is locall scoped to f % so g0/1 evaluates to / B E or . &he rest is just = B % or )'. .1 I want to know all the values in c0)% =% E% % )(1 that are not in c0)% E% )(% ))% )/1. 9ow do I do that with one built!in function in R 9ow could I do it if that function didnt exist -nswer_ setdi50c0)% =% E% % )(1% c0)% E% )(% ))% )/11 and c0)% =% E% % )(1+c0)% =% E% % )(1 TinT c0)% E% )(% ))% )/1 . .1 2an ou write me a function in R that replaces all missing values of a vector with the mean of that vector -nswer_ meanYimpute
.1 9ow do ou test R code 2an ou write a test for the function ou wrote in Z* -nswer_ Kou can use 9adles testthat package. - test might look like this_ testthat0It imputes the median correctl% @ expectYequal0meanYimpute0c0)% '% 8-% *11% /1 C1
)(.1 Sa I have... fn0a% b% c% d% e1 a B b c ! d e
9ow do I call fn on the vector c0)% '% /% =% E1 so that I get the same result as fn0)% '% /% =% E1 08o need to tell me the result% just how to do it.1 -nswer_ do.call0fn% as.list0c0)% '% /% =% E111 )).1 dplr
"h does the dplr package get loaded and not ggplot' -nswer_ deparse0substitute0dplr11 )'.1 msterYmethod
fn
"hat is the value of fn0/1 2an ou explain what is happening at each step -nswer_ ,est seen in steps. fn0/1 requires msterYmethod to be evaluated #rst. msterYmethod0c0function0x1 x B )% function0x1 x x11 evaluates to... function0N1 Reduce0function0% w1 w01% c0function0x1 x B )% function0x1 x x1% N1
8ow% we can see the / in fn0/1 is supposed to be N% giving us... Reduce0function0% w1 w01% c0function0x1 x B )% function0x1 x x1% /1
&his Reduce call is wonk% taking three arguments. - three argument Reduce call will initialiNe at the third argument% which is /. &he inner function% function0% w1 w01 is meant to take an argument and a function and appl that function to the argument. :uckil for us% we have some functions to appl. &hat means we intialiNe at / and appl the #rst function% function0x1 x B ). / B ) H =. "e then take the value = and appl the second function. = = H )*.
! @(!V( A/(=@M= %D %=(!= Deepanshu ,halla ' 2omments R Interview uestions% R &utorial
R is one of the most popular programming language for performing statistical analsis and predictive modeling. ?an recent surves and studies claimed R holds a good percentage of market share in analtics industr. Data scientist role generall requires a candidate to know RPthon programming language. People who know R programming language are generall paid more than pthon and S-S programmers. In terms of advancement in R software% it has improved a lot in the recent ears. It supports parallel computing and integration with big data technologies.
R Interview uestions and -nswers
&he following is a list of most frequentl asked R Programming Interview uestions with detailed answer. It includes some basic% advanced or trick questions related to R. -lso it covers interview questions related to data science with R.
1 How to determine data ty$e of an o%>ect! cass.) is used to determine data tpe of an object. See the example below ! ( )* factor+1- class(x) It returns factor.
Mbject 2lass
&o determine structure of an object% use str.) function _ str+( returns %/actor w - level% "#am$e 2 T (( )* data.frame+var1c+1- class(xx) It returns data.frame. str(xx) returns data.frame - obs. of 1 variable 3 var1 int
2 3at is t3e use of mode.) function! It returns the storage mode of an object. ( )* factor+1- mode(x) &he above mode function returns numeric.
?ode Lunction
( )* data.frame+var1c+1- mode+( It returns list.
& 3ic3 data structure is used to store categorica varia%es! R has a special data structure called factor to store categorical variables. It tells R that a variable is nominal or ordinal b making it a factor. $ender c+1,2,1,2,1,2 $ender factor +$ender $ender
How to c3ec7 t3e fre8uency distri%ution of a categorica varia%e! &he ta%e function is used to calculate the count of each categories of a categorical variable. $ender factor+c+%m%,%f%,%f%,%m%,%f%,%f% table+$ender
Mutput
If ou want to include K of vaues in eac3 grou$% ou can store the result in data frame using data.frame function and the calculate the column percent. t data.frame+table+$ender t3percent round+t3/re' sum+t3/re'4100,2
Lrequenc Distribution
How to c3ec7 t3e cumuative fre8uency distri%ution of a categorica varia%e &he cumsum function is used to calculate the cumulative sum of a categorical variable. $ender factor+c+%m%,%f%,%f%,%m%,%f%,%f% ( table+$ender cumsum+(
2umulative Sum
If ou want to see the cumuative $ercentage of vaues% see the code below _ t data.frame+table+$ender t3cumfre' cumsum+t3/re' t3cumpercent round+t3cumfre' sum+t3/re'4100,2
2umulative Lrequenc Distribution
6 How to $roduce 3istogram &he 3ist function is used to produce the histogram of a variable. df sample+1100, 2- #ist+df, ri$#t/5L"
Produce 9istogram with R
To improve the layout of histogram, you can use the code below colors c+%red%, %yellow%, %$reen%, %violet%, %oran$e%, %blue%, %pin%, %cyan% #ist+df, ri$#t/5L", colcolors, main%7ain 8itle %, (lab%9*5(is 8itle%
9 How to $roduce %ar gra$3 Lirst calculate the frequenc distribution with ta%e function and then appl %ar$ot function to produce bar graph mydata sample+L88&":1-;,16,replace 8&U mydata.count table+mydata barplot+mydata.count
To improve the layout of bar graph, you can use the code below:
colors c+%red%, %yellow%, %$reen%, %violet%, %oran$e%, %blue%, %pin%, %cyan% barplot+mydata.count, colcolors, main%7ain 8itle %, (lab%9*5(is 8itle%
,ar 3raph with R
: How to $roduce Pie B3art Lirst calculate the frequenc distribution with ta%e function and then appl $ie function to produce pie chart. mydata sample+L88&":1-;,16,replace 8&U mydata.count table+mydata pie+mydata.count, colrainbow+12
Pie 2hart with R
; Guti$ication of 2 vectors 3aving di=erent engt3 Lor example% ou have two vectors as de#ned below ! ( )* c+<,-,6 y )* c+2,=
If you run t3is vector M ,- #5y / w3at woud %e t3e out$ut! 3at woud %e t3e engt3 of M! It returns )E )' with the warning message as shown below. &he length of N is / as it has three elements.
?ultiplication of vectors
irst Ste$ T It performs multiplication of the #rst element of vector x i.e. = with #rst element of vector i.e. ' and the result is . In the second ste$/ it multiplies second element of vector x i.e. E with second element of vector b i.e. /% and the result is )E. In the next step% R multiplies #rst element of smaller vector 01 with last element of bigger vector x. Suppose the vector x would contain four elements as shown below _ ( )* c+<,-,6,> y )* c+2,= (4y It returns )E )' '). It works like this _ 0='1 0E/1 0*'1 0/1
10 3at are t3e di=erent data structures R contain! R contains primaril the following data structures _ ). >ector '. ?atrix /. -rra
=. :ist E. Data frame *. Lactor &he #rst three data tpes 0vector% matrix% arra1 are 3omogeneous in behavior. It means all contents must be of the same tpe. &he fourth and #fth data tpes 0list% data frame1 are 3eterogeneous in behavior. It implies the allow di5erent tpes. -nd the factor data tpe is used to store categorical variable.
"#$anation T Data 'y$es .Structures) in R 11 How to com%ine data frames! :ets prepare ' vectors for demonstration _ ( c+1- y c+%m%,%f%,%f%,%m%,%f% &he c%ind.) function is used to combine data frame b coumns. ?cbind+(,y
cbind _ Mutput
&he r%ind.) function is used to combine data frame b rows. ? rbind+(,y
rbind _ Mutput
"hile using c%ind.) function% make sure the num%er of rows must %e e8ua in both the datasets. "hile using r%ind.) function% make sure both
the num%er and names of coumnsmust be same. If names of columns would not be same% wrong data would be appended to columns or records might go missing.
12 How to com%ine data %y rows w3en di=erent num%er of coumns! "hen the number of columns in datasets are not equal% r%ind.) function doesnUt wor7 to combine data b rows. Lor example% we have two data frames df and df'. &he data frame df has ' columns and df' has onl ) variable. See the code below ! df data.frame+( c+1<, y c+%m%,%f%,%f%,%m% df2 data.frame+( c+-@ &he %indCrows.) function from dplr package can be used to combine data frames when number of columns do not match. library+dplyr combdf bind_rows+df,df2
1& 3at are vaid varia%e names in R! - valid variable name consists of letters% numbers and the dot or underline characters. - variable name can start with either a letter or the dot followed b a character .not num%er). 5 variable name suc# as .1var is not valid. Aut .var1 is valid. - variable name cannot have reserved words. &he reserved words are listed below ! if else repeat w#ile function for in ne(t brea 8&U /5L" NULL Inf NaN N5 N5Binte$erB N5BrealB N5Bcomple(B N5Bc#aracterB - variable name can have maximum to )(%((( btes.
1 3at is t3e use of wit3.) and %y.) functions! 3at are its aternatives! Suppose ou have a data frame as shown below !
dfdata.frame+(c+16, yc+1,2,<,6,@,12 Kou are asked to perform this calculation _ .#Fy) F .#-y) ?ost of the R programmers write like code below ! +df3( C df3y C +df3( * df3y 4sing wit3.) function% ou can refer our data frame and make the above code compact and simpler! wit#+df, +(Cy C +(*y &he with01 function is equivalent to pipe operator in dplr package. See the code below ! library+dplyr df DED mutate++(Cy C +(*y
%y.) function in R &he b01 function is equivalent to grou$ %y function in S:. It is used to perform calculation b a factor or a categorical variable. In the example below% we are computing mean of variable var' b a factor var). df data.frame+var1factor+c+1,2,1,2,1,2, var2c+101- wit#+df, by+df, var1, function+( mean+(3var2 &he grou$C%y.) function in dpl package can perform the same task. library+dplyr df DED $roupBby+var1DED summarise+mean+var2
1 How to rename a varia%e! In the example below% we are renaming variable var) to variable). df data.frame+var1c+1- colnames+df:colnames+df var1; )* variable1 &he rename.) function in dplr package can also be used to rename a variable. library+dplyr df rename+df, variable1var1
16 3at is t3e use of w3ic3.) function in R! &he w3ic3.) function returns the position of elements of a logical vector that are &R4J. In the example below% we are #guring out the row number wherein the maximum value of a variable x is recorded. mydatadata.frame+( c+1,=,10,-,> which+mydata3(ma(+mydata3( It returns / as )( is the maximum value and it is at /rd row in the variable x.
19 How to cacuate *rst non-missing vaue in varia%es! Suppose ou have three variables [% K and ; and ou need to extract #rst non!missing value in each rows of these variables. data read.table+te(t% 9 F G N5 1 = N5 2 %, #eader8&U &he coaesce.) function in dplr package can be used to accomplish this task. library+dplyr data DED mutate+varcoalesce+9,F,G
2M-:JS2J Lunction in R
1: How to cacuate ma# vaue for rows! Lets create a sample data frame dt1 read.table+te(t% 9 F G > N5 2<%, #eader8&U
"ith a$$y.) function% we can tell R to appl the max function rowwise. &he na/rm 'R"is used to tell R to ignore missing values while calculating max value. If it is not used% it would return 8-. dt13var apply+dt1,1, function+( ma(+(,na.rm 8&U
Mutput
1; Bount num%er of Meros in a row dt2 read.table+te(t% 5 A H @00 60%, #eader8&U apply+dt2,1, function+( sum+(0
20 Does t3e foowing code wor7! ifelse+df3var1N5, 0,1 It does not work. &he logic operation on 8- returns 8-. It does not &R4J or L-:SJ. &his code works ifese.isna.dfVvar1)/ 0/1)
21 3at woud %e t3e *na vaue of # after running t3e foowing $rogram! xH/ mult
Answer T &he value of x will remain /. See the output shown in the image below!
Mutput
It is because x is de#ned outside function. If ou want to change the value of x after running the function% ou can use the following program_ ( = mult )* function+ J ( ))* 4 2 return+( K mult+2 ( &he operator <
22 How to convert a factor varia%e to numeric &he as.numeric01 function returns a vector of the levels of our factor and not the original values. 9ence% it is required to convert a factor variable to character before converting it to numeric. a )* factor+c+-, 6, >, >, - a1 as.numeric+as.c#aracter+a
2& How to concatenate two strings! &he $aste.) function is used to join two strings. - single space is the default separator between two strings. a %!eepans#u% b %A#alla% paste+a, b It returns Deepanshu ,halla If ou want to change the default single space separator% ou can add sepH% keword to include comma as a separator. paste+a, b, sep%,% returns %!eepans#u,A#alla%
2 How to e#tract *rst & c3aracters from a word &he substr01 function is used to extract strings in a character vector. &he sntax of substr function is su%str.c3aracterCvector/ startingC$osition/ endC$osition) ( %59G2016% substr+(,1,=
B3aracter unctions "#$ained 2 How to e#tract ast name from fu name &he last name is the end string of the name. Lor example% honson is the last name of Dave%on%honson. dt2 read.table+te(t% var "andy,ones !ave,on,#onson %, #eader8&U &he word01 function of stringr package is used to extract or scan word from a string. !) in the second parameter denotes the last word. library+strin$r dt23var2 word+dt23var, *1, sep %,%
26 How to remove eading and traiing s$aces &he trimws.) function is used to remove leading and trailing spaces. a % !avid Aanes % trimws+a It returns David ,anes.
29 How to generate random num%ers
%etween 1 and 100 &he runif01 function is used to generate random numbers. rand runif+100, min 1, ma( 100
2: How to a$$y J"' WNI@ in R! J"' WNI@ implies keeping all rows from the left table 0data frame1 with the matches rows from the right table. In the merge01 function% all.xH&R4J denotes left join. df1data.frame+I!c+1-, "corerunif+-,-0,100 df2data.frame+I!c+=,-,>M, "core2runif+-,1,100 comb merge(df1, df, by !"#$", all.x ! T%&') Jeft Woin .SQJ Stye) library+s'ldf comb s'ldf+select df1.4, df2.4 from df1 left oin df2 on df1.I! df2.I! Jeft Woin wit3 d$y $ac7age library+dplyr comb leftBoin+df1, df2, by %I!%
Woining and Gerging wit3 R
2; How to cacuate cartesian $roduct of two datasets &he cartesian product implies cross product of two tables 0data frames1. Lor example% df) has E rows and df' has E rows. &he combined table would contain 'E rows 0EE1 comb mer$e+df1,df2,byNULL BRNSS WNI@ .SQJ Stye) library+s'ldf comb2 s'ldf+select 4 from df1 oin df2
&0 ni8ue rows common to %ot3 t3e
datasets /irst, create two sample data frames df)Hdata.frame0IDHc0)_E1% ScoreHc0E(_E=11 df'Hdata.frame0IDHc0/%E%_1% ScoreHc0E'%*(_*/11 library+dplyr comb intersect+df1,df2 library+s'ldf comb2 s'ldf+select 4 from df1 intersect select 4 from df2
Mutput _ Intersection with R
&1 How to measure e#ecution time of a $rogram in R! &here are multiple was to measure running time of code. Some frequentl used methods are listed below ! R
&2 3ic3 $ac7age is generay used for fast data mani$uation on arge datasets! &he package datata%e performs fast data manipulation on large datasets. See the comparison between dplr and data.table. Load data library+nyci$#ts1= data+i$#ts df set!8+i$#ts Load re'uired paca$es library+tictoc library+dplyr library+data.table Usin$ data.table paca$e tic+ df:arrBdelay E =0 P dest %I5Q%, .+av$ mean+arrBdelay, si?e .N, by carrier; toc+ Usin$ dplyr paca$e tic+ i$#ts DED Rlter+arrBdelay E =0 P dest %I5Q% DED $roupBby+carrier DED summarise+av$ mean+arrBdelay, si?e n+ toc+ Resut T data.table package took (.(= seconds. whereas dplr package took (.( seconds. So% data.table is approx. =(T faster than dplr. Since the dataset used in the example is of medium siNe% there is no noticeable di5erence between the two. -s siNe of data grows% the di5erence of execution time gets bigger.
&& How to read arge BS *e in R! "e can use fread.) function of data.table package.
library+data.table yyy fread+%HSSUsersSS!aveSS(ample.csv%, #eader 8&U "e can also use read%igmatri#.) function of bigmemor package.
& 3at is t3e di=erence %etween t3e foowing two $rograms ! ). temp H data.frame0v)
& How to remove a t3e o%>ects rm+listls+
&6 3at are t3e various sorting agorit3ms in R! ?ajor #ve sorting algorithms _ ). '. /. =. E.
,ubble Sort Selection Sort ?erge Sort uick Sort ,ucket Sort
&9 Sort data %y muti$e varia%es Breate a sam$e data frame mydata data.frame+score ifelse+si$n+rnorm+2-*1,1,2, e(perience sample+12- 'as7 T Kou need to sort score variable on ascending order and then sort experience variable on descending order.
R
&: Dro$ Guti$e aria%es Suppose ou need to remove / variables ! x% and N from data frame mdata. R
0 How to save everyt3ing in R session save.ima$e+Rle%dt.&!ata%
1 How R 3andes missing vaues! ?issing values are represented b capital 8-. &o create a new data without an missing value% ou can use the code below _ df )* na.omit+mydata
2 How to remove du$icate vaues %y a coumn Suppose ou have a data consisting of 'E records. Kou are asked to remove duplicates based on a column. In the example% we are eliminating duplicates b variable . data data.frame+ysample+12-, replace 8&U, (rnorm+2- R
d$yr Get3od library+dplyr test1 distinct+data, y, .eepBall 8&U
& 3ic3 $ac7ages are used for trans$osing data wit3 R &he reshape' and tidr packages are most popular packages for reshaping data in R.
"#$anation T 'rans$ose Data Bacuate num%er of 3ours/ days/ wee7s/ mont3s and years %etween 2 dates Lets set 2 dates dates )* as.!ate+c+%201-*0M*02%, %2016*0M*0-% ditime+dates:2;, dates:1;, units %#ours% ditime+dates:2;, dates:1;, units %days% oor+ditime+dates:2;, dates:1;, units %wees% oor+ditime+dates:2;, dates:1;, units %days%=6- it3 u%ridate $ac7age library+lubridate interval+dates:1;, interval+dates:1;, interval+dates:1;, interval+dates:1;, interval+dates:1;,
dates:2; dates:2; dates:2; dates:2; dates:2;
DD DD DD DD DD
#ours+1 days+1 wees+1 mont#s+1 years+1
&he number of months unit is not included in the base di5time01 function so we can use interval01 function of lubridate01 package.
How to add & mont3s to a date mydate )* as.!ate+%201-*0M*02% mydate C mont#s+=
6 "#tract date and time from timestam$ mydate )* as.VO"I9lt+%201-*0M*2> 12021<% library+lubridate date+mydate (tractin$ date part format+mydate, format%DQD7D"% (tractin$ time part (tractin$ various time periods day+mydate mont#+mydate year+mydate #our+mydate minute+mydate second+mydate
9 3at are various ways to write oo$ in R &here are primaril three was to write loop in R ). Lor :oop '. "hile :oop /. -ppl Lamil of Lunctions such as -ppl% :appl% Sappl etc
: Di=erence %etween a$$y and sa$$y in R lappl returns a list when we appl a function to each element of a data structure. whereas sappl returns a vector.
; Di=erence %etween sort.)/ ran7.) and order.) functions! &he sort01 function is used to sort a ) dimension vector or a single variable of data. &he rank01 function returns the ranking of each value. &he order01 function returns the indices that can be used to sort the data. "#am$e T set.seed+12=< ( sample+1-0, 10 (
)V * /) /( = =( ' ) )( ' '' sort.#) )V ) * )( '' ' ' /( /) =( = It sorts t#e data on ascendin$ order. ran7.#) )V ' )( * ) / E = ' implies the number in the #rst position is the second lowest and implies the number in the second position is the eighth lowest. order.#) )V ) )( * / ' E = implies the th value of x is the smallest value% so is the #rst element of order0x1 and i refers to the #rst value of x is the second smallest. If you run xorder(x), it would $ive you t#e same result as sort+ function. 8#e dierence between t#ese two functions lies in two or more dimensions of data +two or more columns. In ot#er words, t#e sort+ function cannot be used for more t#an 1 dimension w#ereas (:order+(; can be used.
0 "#tracting @umeric aria%es cols )* sapply+mydata, is.numeric abc mydata :,cols;
Data Science wit3 R Interview Questions &he list below contains most frequentl asked interview questions for a role of data scientist. ?ost of the roles related to data science or predictive modeling require candidate to be well conversant with R and know how to develop and validate predictive models with R.
1 3ic3 function is used for %uiding inear regression mode! &he lm01 function is used for #tting a linear regression model.
2 How to add interaction in t3e inear regression mode! _-n interaction can be created using colon sign 0_1. Lor example% x) and x' are two predictors 0independent variables1. &he interaction between the variables can be formed like #1T#2 See t3e e#am$e %eow linre$1 )* lm+y W (1 C (2 C (1(2, datamydata &he above code is equivalent to the following code _ linre$1 )* lm+y W (14(2, datamydata #1T#2 - It implies including both main e5ects 0x) B x'1 and interaction 0x)_x'1.
& How to c3ec7 autocorreation assum$tion for inear regression! durbin"atson&est01 function
3ic3 function is usefu for deveo$ing a %inary ogistic regression mode! glm01 function with famil H binomial
How to $erform ste$wise varia%e seection in ogistic regression mode! Run step01 function after building logistic model with glm01 function.
6 How to do scoring in t3e ogistic regression mode! Run predict0logitYmodel% validationYdata% tpe H response1
9 How to s$it data into training and
vaidation! dt sort+sample+nrow+mydata, nrow+mydata4.> train)*mydata:dt,; val)*mydata:*dt,;
: How to standardiMe varia%es! data' H scale0data1
; How to vaidate custer anaysis aidate Buster Anaysis 60 3ic3 are t3e $o$uar R $ac7ages for decision tree! rpart% part
61 3at is t3e di=erence %etween r$art and $arty $ac7age for deveo$ing a decision tree mode! rpart is based on 3ini Index which measures impurit in node. "hereas ctree01 function from part package uses a signi#cance test procedure in order to select variables.
62 How to c3ec7 correation wit3 R! cor01 function
6& Have you 3eard Ureaim$oU $ac7age! It is used to measure the relative importance of independent variables in a model.