Using LASSO LASSO from lars (or glmne glmnet) t) package package in R for variable selection selection - ...
http://stat http://stats.stac s.stackex kexch chang ange.com e.com/ques /question tions/58531/u s/58531/using sing-lasso-from -lasso-from-lars-or-... -lars-or-...
sign up
log in
tour
_
Here's how it works:
Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization. It’s 100% free, no registration required.
Anybody can ask a question
Anybody can answer
The best answers are voted up and rise to the top
Sign up
Using LASSO from lars (or glmnet) package in R for variable selection Sorry if this question comes across a little basic. I am looking to use LASSO variable selection for a multiple linear regression model in R. I have 15 predictors, one of which is categorical(will that cause a probl problem? em?). ). After After setting setting my x and y I use the the follow following ing comm command ands: s: model=lars(x,y) coef(model)
My pr probl oblem e m is when when I use use coef(model) . Thi This s retur returns ns a ma matri trix x wit with h 15 15 row rows, wit with h one one ext extra ra pred predict ictor or adde added d each each ti time me.. How Howeve everr ther there e is no suggestion suggesti on as to which which model model to to choose. Have I missed missed something? something? Is Is there a way way I can get get the lars ars package to return return just just one "best " model? model? There are There are othe otherr posts posts sug sugges gesti ting ng usi using ng Have I missed something here?:
glmnet
inste in stead ad but but th this is seem seems s more more com compl plica icate ted. d. An att attem empt pt is as as follo follow ws, usin using g the the sam same e x an and d y .
cv=cv.glmnet(x,y) model=glmnet(x,y,type.gaussian="covariance" model=glmnet(x,y,type.gaussian= "covariance",lambda=cv$lambda.min) ,lambda=cv$lambda.min) predict(model, type="coefficients" type="coefficients") )
The final comm command and returns returns a list of my variables, variables, the majority majority wi with th a coefficient alth although ough some some are =0. Is this this the correct choice of the the " best " model selected by LASSO? LASSO? IfIf I then then fit a linear linear model wi with th all my my variables variables which which had had coefficients not=0 I get very similar similar,, but but slight slightly ly different, coefficient estimates. Is there a reason for this difference? Would it be acceptable to refit the linear model with these variables chosen by LASSO and take that as my final model? Otherwise I cannot see any p-values for significance. Have I missed anything? Does type.gaussian="covariance"
ensu en sure re th that that
glmnet
use ses s mu multi tipl ple lin linea earr regr regres essi sion on? ?
Does the automatic normalisation of the variables affect the coefficients at all? Is there any way to include interaction terms in a LASSO procedure? I am looking to use this procedure more as a demonstration of how LASSO can be used than for any model that will actually be used for any important inference/prediction if that changes anything. Thank you for taking the time to read this. Any general comments on LASSO/lars/glmnet would also be greatly appreciated. featur ee- se sele ct cti on on
lasso
glmnet
lar s
edited May 9 ’13 at 10:53 10:53
asked May 8 ’13 at 23:57 23:57
James 61
2 As a side commen comment, t, if you want to interpret interpret the result be sure sure to demonstrate the the that set of variables variables selected by lasso is stable. This can be done using Monte Carlo simulation or by bootstrapping your own dataset. – Frank Harrell Sep 15 ’13 at 8:43
5 Answers
Using glmnet is real Using reallly easy easy once once yo you u get get th the e grasp grasp of itit than thanks ks to to its its excel excelle lent nt vi vign gnett ette e in in http://web.stanfor http://w eb.stanfor d.edu/~hastie/glm d.edu/~hastie/glmnet/glm net/glmnet_alpha.html net_alpha.html (you can also c heck the CRAN pack pa ckag age e pag page) e).. As As for th the e best best lam ambd bda a for glmnet , the the ru rulle of thu thum mb is is to us use e cvfit <- glmnet::cv.glmnet(x, y) coef(cvfit, s = "lambda.1se")
instead of
lambda.min
.
help
1
1
4
Using LASSO from lars (or glmnet) package in R for variable selection - ...
To do the same for
lars
http://stats.stackexchange.com/questions/58531/using-lasso-from-lars-or-...
you have to do it by hand. Here is my solution
cv <- lars::cv.lars(x, y, plot.it = FALSE, mode = "step") idx <- which.max(cv$cv - cv$cv.error <= min(cv$cv)) coef(lars::lars(x, y))[idx,]
Bear in mind that this is not exactly the same, because this is stopping at a lasso knot (when a variable enters) instead of at any point. Please note that glmnet is the preferred package now, it is actively maintained, more so than lars , and that there have been questions about glmnet vs lars answered before (algorithms used differ). As for your question of using lasso to choose variables and then fit OLS, it is an ongoing debate. Google for OLS post Lasso and there are some papers discussing the topic. Even the authors of Elements of Statistical Learning admit it is possible. Edit : Here is the code to reproduce more accurately what
glmnet
does in
lars
cv <- lars::cv.lars(x, y, plot.it = FALSE) ideal_l1_ratio <- cv$index[which.max(cv$cv - cv$cv.error <= min(cv$cv))] obj <- lars::lars(x, y) scaled_coefs <- scale(obj$beta, FALSE, 1 / obj$normx) l1 <- apply(X = scaled_coefs, MARGIN = 1, FUN = function(x) sum(abs(x))) coef(obj)[which.max(l1 / tail(l1, 1) > ideal_l1_ratio),]
edited Oct 28 ’14 at 1:31
answered Oct 27 ’14 at 5:34
Juancentro 213
2
6
Perhaps the comparison with forward selection stepwise regression will help (see the following link to a site by one of the authors http://www-stat.stanford.edu/~tibs/lasso/simple.html). This is he approach used in Chapter 3.4.4 of The Elements of Statistical Learning (available online for free). I thought that Chapter 3.6 in that book helped to understand the relationship between least squares, best subset, and lasso (plus a couple of other procedures). I also find it helpful o take the transpose of the coefficient, t(coef(model)) and write.csv, so that I can open it in Excel along with a copy of the plot(model) on the side . You might want to sort by the last column, which contains the least squares estimate. Then you can see clearly how each variable gets added at each piecewise step and how the coefficients change as a result. Of course this is not the whole story, but hopefully it will be a start. answered May 17 ’13 at 17:31
Joel Cadwell 41
1
I’m returning to this question from a while ago since I think I’ve solved the correct solution. Here’s a replica using the mtcars dataset: library(glmnet) `%ni%`<-Negate(`%in%') data(mtcars) x<-model.matrix(mpg~.,data=mtcars) x=x[,-1] glmnet1<-cv.glmnet(x=x,y=mtcars$mpg,type.measure='mse',nfolds=5,alpha=.5) c<-coef(glmnet1,s='lambda.min',exact=TRUE) inds<-which(c!=0) variables<-row.names(c)[inds] variables<-variables[variables %ni% '(Intercept)']
’variables’ gives you the list of the variables that solve the best solution. edited Jun 30 ’15 at 18:47
answered Apr 1 ’14 at 19:10
Jason 141
1
4
I was looking to the code and I find that "testing" was not defined yet and therefore the code: "final.list<-testing[-removed] #removing variables" gives the error: object not found So looking to the code I suppose that instead of using "testing" it should be used "cp.list" so that the code will be: final.list<-cp.list[-removed] #removing variables final.list<-c(final.list,duplicates) #adding in those vars which were both removed then added later Let me know if this is correct Kind regards – user55101 Sep 2 ’14 at 14:27
Using LASSO from lars (or glmnet) package in R for variable selection - ...
1
http://stats.stackexchange.com/questions/58531/using-lasso-from-lars-or-...
‘%ni%‘ <-Negate(‘%ni%‘); ##looks wrong. While ‘%ni%‘<-Negate(‘% in%‘); ##looks right. I think the stackexchange formatter messed it up... – Chris Jun 30 ’15 at 4:17
LARS solves the ENTIRE solution path. The solution path is piecewise linear -- there are a finite number of "notch" points (i.e., values of the regularization parameter) at which the solution changes. So the matrix of solutions you’re getting is all the possible solutions. In the list that it returns, it should also give you the values of the regularization parameter. answered May 9 ’13 at 10:19
Adam 21
1
Thank you for your answer. Is there a way to display the values of the regularisation parameter? Additionally is there a way to then choose between the solutions based on this parameter? (Also is the parameter lambda?) – James May 9 ’13 at 10:31
lars and glmnet operate on raw matrices. To includ interaction terms, you will have to construct the matrices yourself. That means one column per interaction (which is per level per factor if you have factors). Look into lm() to see how it does it (warning: there be dragons).
To do it right now, do something like: To make an interaction term manually, you could (but maybe shouldn't , because it’s slow) do: int = D["x1"]*D["x2"] names(int) = c("x1*x2") D = cbind(D, int)
Then to use this in lars (assuming you have a
y
kicking around):
lars(as.matrix(D), as.matrix(y))
I wish I could help you more with the other questions. I found this one because lars is giving me grief and the documentation in it and on the web is very thin. edited Mar 24 ’14 at 17:00
answered Mar 24 ’14 at 12:21
kousu 25
1
5
"Warning: there be dragons" This is pretty easy with model.matrix() . – Gregor Nov 25 ’14 at 23:52
protected by Community ♦ Sep 4 ’14 at 7:22 Thank you for your interest in this question. Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site. Would you like to answer one of these unanswered questions instead?