library(car)
Lecture modified from Chapter 24 in ‘R for Statistical Learning’, by David Dalpiaz
We will use the Hitters
dataset from the ISLR
package to explore two shrinkage methods: ridge regression and lasso. These are otherwise known as penalized regression methods.
data(Hitters, package = "ISLR")
This dataset has some missing data in the response Salaray
. We use the na.omit()
function the clean the dataset.
sum(is.na(Hitters))
[1] 59
sum(is.na(Hitters$Salary))
[1] 59
Hitters = na.omit(Hitters)
sum(is.na(Hitters))
[1] 0
The predictors variables are offensive and defensive statistics for a number of baseball players.
names(Hitters)
[1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
[7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
[13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
[19] "Salary" "NewLeague"
Let’s look at just a few of the variables:
library(GGally)
ggpairs(dplyr::select(Hitters,Salary,'Hits','CHits',HmRun,CHmRun,Runs,CRuns))
We use the glmnet()
and cv.glmnet()
functions from the glmnet
package to fit penalized regressions.
library(glmnet)
Unfortunately, the glmnet
function does not allow the use of model formulas, so we setup the data for ease of use with glmnet
. Eventually we will use train()
from caret
which does allow for fitting penalized regression with the formula syntax, but to explore some of the details, we first work with the functions from glmnet
directly.
X = model.matrix(Salary ~ ., Hitters)[, -1]
y = Hitters$Salary
First, we fit an ordinary linear regression, and note the size of the predictors’ coefficients, and predictors’ coefficients squared. (The two penalties we will use.)
fit = lm(Salary ~ ., Hitters)
coef(fit)
(Intercept) AtBat Hits HmRun Runs RBI
163.1036 -1.9799 7.5008 4.3309 -2.3762 -1.0450
Walks Years CAtBat CHits CHmRun CRuns
6.2313 -3.4891 -0.1713 0.1340 -0.1729 1.4543
CRBI CWalks LeagueN DivisionW PutOuts Assists
0.8077 -0.8116 62.5994 -116.8492 0.2819 0.3711
Errors NewLeagueN
-3.3608 -24.7623
sum(abs(coef(fit)[-1]))
[1] 238.7
sum(coef(fit)[-1] ^ 2)
[1] 18337
vif(fit)
AtBat Hits HmRun Runs RBI Walks Years CAtBat
22.944 30.281 7.759 15.246 11.922 4.149 9.313 251.561
CHits CHmRun CRuns CRBI CWalks League Division PutOuts
502.954 46.488 162.521 131.966 19.744 4.134 1.075 1.236
Assists Errors NewLeague
2.709 2.215 4.099
Note that we have a lot of coefficients with large values. The VIFs indicate that we have a high degree of colinearity.
The sum of the absolute values of the coefficients is 238.7, and the sum of the coefficients squared is 18,337. These two sums are what are penalized by ridge regression and lasso, respectively, so we should see them decrease when we use those methods.
We first illustrate ridge regression, which can be fit using glmnet()
with alpha = 0
and seeks to minimize
\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} \beta_j^2 . \]
Notice that the intercept is not penalized. Also, note that that ridge regression is not scale invariant like the usual unpenalized regression. Thankfully, glmnet()
takes care of this internally. It automatically standardizes predictors for fitting, then reports fitted coefficient using the original scale.
The two plots illustrate how much the coefficients are penalized for different values of \(\lambda\). Notice none of the coefficients are forced to be zero.
library(glmnet)
par(mfrow = c(1, 2))
fit_ridge = glmnet(X, y, alpha = 0)
plot(fit_ridge, xvar = "lambda", label = TRUE)
plot(fit_ridge)
We use cross-validation to select a good \(\lambda\) value. The cv.glmnet()
function uses 10 folds by default. The plot illustrates the MSE for the \(\lambda\)s considered. Two lines are drawn. The first is the \(\lambda\) that gives the smallest MSE. The second is the \(\lambda\) that gives an MSE within one standard error of the smallest.
fit_ridge_cv = cv.glmnet(X, y, alpha = 0)
plot(fit_ridge_cv)
The cv.glmnet()
function returns several details of the fit for both \(\lambda\) values in the plot. Notice the penalty terms are smaller than the full linear regression. (As we would expect.)
# fitted coefficients, using 1-SE rule lambda, default behavior
coef(fit_ridge_cv)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 213.066444
AtBat 0.090096
Hits 0.371253
HmRun 1.180127
Runs 0.596298
RBI 0.594502
Walks 0.772525
Years 2.473494
CAtBat 0.007598
CHits 0.029272
CHmRun 0.217336
CRuns 0.058705
CRBI 0.060722
CWalks 0.058699
LeagueN 3.276568
DivisionW -21.889943
PutOuts 0.052667
Assists 0.007464
Errors -0.145121
NewLeagueN 2.972759
# fitted coefficients, using minimum lambda
coef(fit_ridge_cv, s = "lambda.min")
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 76.4582387
AtBat -0.6315180
Hits 2.6421596
HmRun -1.3882331
Runs 1.0457295
RBI 0.7315713
Walks 3.2780010
Years -8.7237336
CAtBat 0.0001256
CHits 0.1318975
CHmRun 0.6895578
CRuns 0.2830055
CRBI 0.2514905
CWalks -0.2599851
LeagueN 52.3371962
DivisionW -122.4169978
PutOuts 0.2623667
Assists 0.1629044
Errors -3.6440019
NewLeagueN -17.0259792
# penalty term using minimum lambda
sum(coef(fit_ridge_cv, s = "lambda.min")[-1] ^ 2)
[1] 18127
# fitted coefficients, using 1-SE rule lambda
coef(fit_ridge_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 213.066444
AtBat 0.090096
Hits 0.371253
HmRun 1.180127
Runs 0.596298
RBI 0.594502
Walks 0.772525
Years 2.473494
CAtBat 0.007598
CHits 0.029272
CHmRun 0.217336
CRuns 0.058705
CRBI 0.060722
CWalks 0.058699
LeagueN 3.276568
DivisionW -21.889943
PutOuts 0.052667
Assists 0.007464
Errors -0.145121
NewLeagueN 2.972759
# penalty term using 1-SE rule lambda
sum(coef(fit_ridge_cv, s = "lambda.1se")[-1] ^ 2)
[1] 507.8
# predict using minimum lambda
predict(fit_ridge_cv, X, s = "lambda.min")
# predict using 1-SE rule lambda, default behavior
predict(fit_ridge_cv, X)
# calcualte "train error"
mean((y - predict(fit_ridge_cv, X)) ^ 2)
[1] 132356
# CV-RMSE using minimum lambda
sqrt(fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.min])
[1] 334
# CV-RMSE using 1-SE rule lambda
sqrt(fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.1se])
[1] 368.7
We now illustrate lasso, which can be fit using glmnet()
with alpha = 1
and seeks to minimize
\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} |\beta_j| . \]
Like ridge, lasso is not scale invariant.
The two plots illustrate how much the coefficients are penalized for different values of \(\lambda\). Notice some of the coefficients are forced to be zero.
par(mfrow = c(1, 2))
fit_lasso = glmnet(X, y, alpha = 1)
plot(fit_lasso, xvar = "lambda", label = TRUE)
plot(fit_lasso)
Again, to actually pick a \(\lambda\), we will use cross-validation. The plot is similar to the ridge plot. Notice along the top is the number of features in the model. (Which changed in this plot.)
fit_lasso_cv = cv.glmnet(X, y, alpha = 1)
plot(fit_lasso_cv)
cv.glmnet()
returns several details of the fit for both \(\lambda\) values in the plot. Notice the penalty terms are again smaller than the full linear regression. (As we would expect.) Some coefficients are 0.
# fitted coefficients, using 1-SE rule lambda, default behavior
coef(fit_lasso_cv)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 144.37970
AtBat .
Hits 1.36380
HmRun .
Runs .
RBI .
Walks 1.49731
Years .
CAtBat .
CHits .
CHmRun .
CRuns 0.15275
CRBI 0.32834
CWalks .
LeagueN .
DivisionW .
PutOuts 0.06626
Assists .
Errors .
NewLeagueN .
# fitted coefficients, using minimum lambda
coef(fit_lasso_cv, s = "lambda.min")
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 129.4156
AtBat -1.6130
Hits 5.8059
HmRun .
Runs .
RBI .
Walks 4.8469
Years -9.9724
CAtBat .
CHits .
CHmRun 0.5375
CRuns 0.6812
CRBI 0.3904
CWalks -0.5560
LeagueN 32.4646
DivisionW -119.3481
PutOuts 0.2742
Assists 0.1856
Errors -2.1651
NewLeagueN .
# penalty term using minimum lambda
sum(coef(fit_lasso_cv, s = "lambda.min")[-1] ^ 2)
[1] 15463
# fitted coefficients, using 1-SE rule lambda
coef(fit_lasso_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 144.37970
AtBat .
Hits 1.36380
HmRun .
Runs .
RBI .
Walks 1.49731
Years .
CAtBat .
CHits .
CHmRun .
CRuns 0.15275
CRBI 0.32834
CWalks .
LeagueN .
DivisionW .
PutOuts 0.06626
Assists .
Errors .
NewLeagueN .
# penalty term using 1-SE rule lambda
sum(coef(fit_lasso_cv, s = "lambda.1se")[-1] ^ 2)
[1] 4.237
# predict using minimum lambda
predict(fit_lasso_cv, X, s = "lambda.min")
# predict using 1-SE rule lambda, default behavior
predict(fit_lasso_cv, X)
# calcualte "train error"
mean((y - predict(fit_lasso_cv, X)) ^ 2)
[1] 121291
broom
Sometimes, the output from glmnet()
can be overwhelming. The broom
package can help with that.
library(broom)
# the output from the commented line would be immense
# fit_lasso_cv
tidy(fit_lasso_cv)
lambda estimate std.error conf.high conf.low nzero
1 255.2821 204878 24844 229722 180034 0
2 232.6035 196011 24615 220626 171396 1
3 211.9397 186655 23805 210460 162849 2
4 193.1115 178815 23087 201902 155728 2
5 175.9560 171624 22593 194218 149031 3
6 160.3246 164416 22349 186764 142067 4
7 146.0818 157443 21874 179317 135569 4
8 133.1043 151266 21386 172651 129880 4
9 121.2797 146237 20981 167217 125256 4
10 110.5055 142146 20678 162825 121468 4
11 100.6885 138784 20493 159277 118291 5
12 91.7436 135832 20401 156233 115431 5
13 83.5934 133379 20364 153743 113015 5
14 76.1672 131379 20357 151735 111022 5
15 69.4007 129477 20400 149877 109077 6
16 63.2353 127688 20467 148155 107221 6
17 57.6177 125985 20403 146388 105581 6
18 52.4991 124195 20170 144365 104025 6
19 47.8352 122678 19969 142646 102709 6
20 43.5857 121416 19809 141224 101607 6
21 39.7136 120368 19682 140050 100687 6
22 36.1856 119496 19582 139078 99914 6
23 32.9710 118772 19503 138275 99269 6
24 30.0419 118171 19440 137611 98730 6
25 27.3731 117670 19392 137062 98278 6
26 24.9413 117266 19353 136618 97913 6
27 22.7256 116991 19327 136319 97664 6
28 20.7067 116790 19303 136093 97486 6
29 18.8672 116642 19286 135927 97356 6
30 17.1911 116544 19272 135816 97272 7
31 15.6639 116500 19272 135772 97227 7
32 14.2723 116446 19277 135723 97169 7
33 13.0044 116409 19288 135696 97121 9
34 11.8491 116392 19296 135688 97095 9
35 10.7965 116445 19291 135736 97154 9
36 9.8374 116830 19280 136110 97550 9
37 8.9634 117368 19251 136619 98117 9
38 8.1672 117774 19195 136969 98580 11
39 7.4416 117864 19168 137032 98696 11
40 6.7805 117642 19149 136791 98493 12
41 6.1782 117236 19144 136380 98091 12
42 5.6293 116725 19048 135773 97676 13
43 5.1292 116282 18971 135254 97311 13
44 4.6735 115998 18899 134897 97099 13
45 4.2584 115713 18820 134533 96893 13
46 3.8801 115425 18743 134168 96682 13
47 3.5354 115185 18682 133867 96503 13
48 3.2213 114975 18638 133613 96337 13
49 2.9351 114787 18568 133354 96219 13
50 2.6744 114655 18463 133118 96193 13
51 2.4368 114592 18367 132959 96226 13
52 2.2203 114642 18281 132923 96361 14
53 2.0231 114741 18198 132939 96543 15
54 1.8433 114865 18119 132984 96745 15
55 1.6796 115034 18017 133051 97017 17
56 1.5304 115247 17951 133198 97296 17
57 1.3944 115452 17901 133353 97551 17
58 1.2705 115629 17857 133485 97772 17
59 1.1577 115849 17830 133680 98019 17
60 1.0548 116050 17799 133849 98251 17
61 0.9611 116216 17762 133978 98454 17
62 0.8757 116367 17727 134094 98640 17
63 0.7979 116510 17696 134207 98814 17
64 0.7271 116654 17664 134319 98990 17
65 0.6625 116795 17636 134431 99159 18
66 0.6036 116944 17615 134558 99329 18
67 0.5500 117071 17600 134671 99470 18
68 0.5011 117198 17587 134785 99611 17
69 0.4566 117326 17576 134902 99750 18
70 0.4160 117427 17568 134995 99860 18
71 0.3791 117536 17557 135094 99979 18
72 0.3454 117613 17550 135163 100062 18
# the two lambda values of interest
glance(fit_lasso_cv)
lambda.min lambda.1se
1 2.437 76.17
Aside from simply shrinking coefficients (ridge) and setting some coefficients to 0 (lasso), penalized regression also has the advantage of being able to handle the \(p > n\) case.
set.seed(1234)
n = 1000
p = 5500
X = replicate(p, rnorm(n = n))
beta = c(1, 1, 1, rep(0, 5497))
z = X %*% beta
prob = exp(z) / (1 + exp(z))
y = as.factor(rbinom(length(z), size = 1, prob = prob))
We first simulate a classification example where \(p > n\).
# glm(y ~ X, family = "binomial")
# will not converge
We then use a lasso penalty to fit penalized logistic regression. This minimizes
\[ \sum_{i=1}^{n} L\left(y_i, \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}\right) + \lambda \sum_{j=1}^{p} |\beta_j| \]
where \(L\) is the appropriate negative log-likelihood.
library(glmnet)
fit_cv = cv.glmnet(X, y, family = "binomial", alpha = 1)
plot(fit_cv)
head(coef(fit_cv), n = 10)
10 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 0.02397
V1 0.59675
V2 0.56252
V3 0.60065
V4 .
V5 .
V6 .
V7 .
V8 .
V9 .
fit_cv$nzero
s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 s18 s19
0 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 s36 s37 s38 s39
3 3 3 3 3 3 3 3 3 3 4 6 7 10 18 24 35 54 65 75
s40 s41 s42 s43 s44 s45 s46 s47 s48 s49 s50 s51 s52 s53 s54 s55 s56 s57 s58 s59
86 100 110 129 147 168 187 202 221 241 254 269 283 298 310 324 333 350 364 375
s60 s61 s62 s63 s64 s65 s66 s67 s68 s69 s70 s71 s72 s73 s74 s75 s76 s77 s78 s79
387 400 411 429 435 445 453 455 462 466 475 481 487 491 496 498 502 504 512 518
s80 s81 s82 s83 s84 s85 s86 s87 s88 s89 s90 s91 s92 s93 s94 s95 s96 s97 s98
523 526 528 536 543 550 559 561 563 566 570 571 576 582 586 590 596 596 600
Notice, only the first three predictors generated are truly significant, and that is exactly what the suggested model finds.
fit_1se = glmnet(X, y, family = "binomial", lambda = fit_cv$lambda.1se)
which(as.vector(as.matrix(fit_1se$beta)) != 0)
[1] 1 2 3
We can also see in the following plots, the three features entering the model well ahead of the irrelevant features.
par(mfrow = c(1, 2))
plot(glmnet(X, y, family = "binomial"))
plot(glmnet(X, y, family = "binomial"), xvar = "lambda")
We can extract the two relevant \(\lambda\) values.
fit_cv$lambda.min
[1] 0.03087
fit_cv$lambda.1se
[1] 0.0515
Since cv.glmnet()
does not calculate prediction accuracy for classification, we take the \(\lambda\) values and create a grid for caret
to search in order to obtain prediction accuracy with train()
. We set \(\alpha = 1\) in this grid, as glmnet
can actually tune over the \(\alpha = 1\) parameter. (More on that later.)
Note that we have to force y
to be a factor, so that train()
recognizes we want to have a binomial response. The train()
function in caret
use the type of variable in y
to determine if you want to use family = "binomial"
or family = "gaussian"
.
library(caret)
cv_5 = trainControl(method = "cv", number = 5)
lasso_grid = expand.grid(alpha = 1,
lambda = c(fit_cv$lambda.min, fit_cv$lambda.1se))
lasso_grid
alpha lambda
1 1 0.03087
2 1 0.05150
sim_data = data.frame(y, X)
fit_lasso = train(
y ~ ., data = sim_data,
method = "glmnet",
trControl = cv_5,
tuneGrid = lasso_grid
)
fit_lasso$results
alpha lambda Accuracy Kappa AccuracySD KappaSD
1 1 0.03087 0.7679 0.5358 0.03430 0.06845
2 1 0.05150 0.7689 0.5378 0.02807 0.05596
The interaction between the glmnet
and caret
packages is sometimes frustrating, but for obtaining results for particular values of \(\lambda\), we see it can be easily used.
Here we analyze data from a microarray study of Singh et al. (2002). This data set contains measurements of the gene expression of 6033 genes for 102 observations: 52 prostate cancer patients and 50 healty men. The goal is to see which genes were most associated with the cancer diagnosis.
library(sda)
data("singh2002")
dim(singh2002$y) # number of observations
NULL
dim(singh2002$x)
[1] 102 6033
fit <- glmnet(singh2002$x, singh2002$y, family = "binomial",pmax=20,alpha=1)
Warning: from glmnet Fortran code (error code -10015); Number of nonzero
coefficients along the path exceeds pmax=20 at 15th lambda value; solutions for
larger lambdas returned
plot(fit,label=TRUE,xvar="lambda")
fit_lasso_cv = cv.glmnet(singh2002$x, singh2002$y, family = "binomial", nfolds=5,alpha = 0.99,type.measure='class')
coefs<-coef(fit_lasso_cv)
summary(coefs)
6034 x 1 sparse Matrix of class "dgCMatrix", with 44 entries
i j x
1 1 1 -0.493161
2 299 1 0.027846
3 333 1 -0.307869
4 365 1 0.040395
5 378 1 -0.053895
6 453 1 -0.110068
7 580 1 -0.168770
8 611 1 -0.373868
9 703 1 0.009255
10 722 1 -0.015974
11 740 1 0.001608
12 806 1 0.023213
13 915 1 -0.207116
14 922 1 0.154351
15 1069 1 -0.250180
16 1078 1 -0.140912
17 1090 1 -0.207007
18 1114 1 -0.064859
19 1558 1 -0.030528
20 1590 1 0.092322
21 1721 1 -0.414631
22 1967 1 -0.032548
23 2392 1 0.003964
24 2853 1 -0.038024
25 2869 1 -0.060466
26 3018 1 0.101644
27 3188 1 0.001800
28 3283 1 -0.015787
29 3293 1 0.020324
30 3376 1 -0.120177
31 3648 1 -0.121972
32 3697 1 0.003471
33 3880 1 -0.021968
34 3941 1 0.096150
35 4074 1 0.056335
36 4089 1 0.037156
37 4105 1 0.025749
38 4155 1 0.048707
39 4317 1 0.090442
40 4332 1 0.158554
41 4397 1 0.088418
42 4519 1 -0.235190
43 4547 1 0.064363
44 4982 1 -0.089703
Let’s look more carefully at the results of the LASSO regression for the Hitters data. The plotres
function gives some useful graphs.
library(plotmo)
X = model.matrix(Salary ~ ., Hitters)[, -1]
y = Hitters$Salary
fit_lasso_cv = cv.glmnet(X, y, alpha = 1)
plot(fit_lasso_cv)
plotres(fit_lasso_cv)
I see some definite heteroskedasticity, which is not surprising for a variable like Salary
; income data is often very right-skewed and often has non-constant variance. Let’s check to be sure:
hist(Hitters$Salary)
Yup! Let’s try a
log
transformation to make the response more normal, and see if this fixes the heteroscedasticity.
hist(log(Hitters$Salary))
Looks better already. Let’s do the fit…
logHitters<-mutate(Hitters,logSalary=log(Salary))
#Xlog = model.matrix(logSalary ~ . - Salary, logHitters)[, -1]
ylog = logHitters$logSalary
log_fit_lasso_cv = cv.glmnet(X, ylog, alpha = 1)
plotres(log_fit_lasso_cv)
Did using the log
transformation change the predictors found by the LASSO fit? Let’s see…
coef(fit_lasso_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 193.74264
AtBat .
Hits 1.21471
HmRun .
Runs .
RBI .
Walks 1.28958
Years .
CAtBat .
CHits .
CHmRun .
CRuns 0.12924
CRBI 0.31516
CWalks .
LeagueN .
DivisionW .
PutOuts 0.02533
Assists .
Errors .
NewLeagueN .
coef(log_fit_lasso_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 4.9970914
AtBat .
Hits 0.0033360
HmRun .
Runs .
RBI 0.0003002
Walks 0.0029740
Years 0.0080495
CAtBat .
CHits 0.0003775
CHmRun .
CRuns 0.0002244
CRBI 0.0000610
CWalks .
LeagueN .
DivisionW .
PutOuts .
Assists .
Errors .
NewLeagueN .
Yes, the chosen coefficients aren’t exactly the same.
Now I’m concerned about outliers:
plotres(log_fit_lasso_cv,which=3)
What is going on with Mike Schmidt? He is observation #173. Let’s look at a few observations around him:
Hitters[173:175,c(1:8,19)] # Just show a few predictors for space reasons
AtBat Hits HmRun Runs RBI Walks Years CAtBat Salary
-Mike Schmidt 20 1 0 0 0 0 2 41 2127
-Mike Scioscia 374 94 5 36 26 62 7 1968 875
-Mickey Tettleton 211 43 10 26 35 39 3 498 120
He’s had only 1 hit, 0 home runs, 0 runs, and has only been in the league 2 years, yet his salary seems very high. Let’s confirm that:
hist(Hitters$Salary)
He is almost the highest payed player, according to this data. What is going on? Let’s Google Mike Schmidt.
Mike Schmidt
Mike Schmidt played with the Phillies starting in 1972, so he’d been playing for 14 years by 1986. According to Wikipedia:
“Schmidt was a twelve-time All-Star and a three-time winner of the National League (NL) Most Valuable Player award (MVP), and he was known for his combination of power hitting and strong defense: as a hitter, he compiled 548 home runs and 1,595 runs batted in (RBIs), and led the NL in home runs eight times and in RBIs four times.”"
This all explains why Schmidt was payed so much, but not why he’s listed as having 0 runs and 0 home runs. I’m going to conclude that there’s something wrong with this data, and run the fit without Schmidt.
#Xsans = model.matrix(logSalary ~ . - Salary, logHitters[-173,])[, -1]
ysans = logHitters[-173,]$logSalary
log_fit_lasso_cv_sans = cv.glmnet(X[-173,], ysans, alpha = 1)
plotres(log_fit_lasso_cv_sans)
Terry Kennedy seems to be a similar situation.
#Xsans = model.matrix(logSalary ~ . - Salary, logHitters[-173,])[, -1]
ysans2 = logHitters[-c(173,241),]$logSalary
log_fit_lasso_cv_sans2 = cv.glmnet(X[-c(173,241),], ysans2, alpha = 1)
plotres(log_fit_lasso_cv_sans2)
Steve Sax is a bit of an outlier here, but I don’t see anything wrong with his data. Also, I doubt he is very influential, since he’s near the middle of the fitted value chart.
Let’s look at the final coefficients:
coef(log_fit_lasso_cv_sans2)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 4.60250178
AtBat .
Hits 0.00509315
HmRun .
Runs .
RBI 0.00055799
Walks 0.00463658
Years 0.03839075
CAtBat .
CHits 0.00031499
CHmRun .
CRuns 0.00009165
CRBI .
CWalks .
LeagueN .
DivisionW -0.00915693
PutOuts .
Assists .
Errors .
NewLeagueN .
Would we get similar values if we did a non-regularized fit for the same predictors? (Plus an advantage of the lm
– we get nicer diagnostic plots by default:)
linfit2<-lm(log(Salary)~Hits+RBI+Walks+Years+CHits+CRuns+Division,data=Hitters[-c(173,241),])
summary(linfit2)
Call:
lm(formula = log(Salary) ~ Hits + RBI + Walks + Years + CHits +
CRuns + Division, data = Hitters[-c(173, 241), ])
Residuals:
Min 1Q Median 3Q Max
-2.064 -0.446 0.068 0.416 1.335
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.276127 0.142734 29.96 < 2e-16 ***
Hits 0.006435 0.001464 4.40 0.000016 ***
RBI 0.001019 0.002317 0.44 0.66058
Walks 0.007025 0.002321 3.03 0.00273 **
Years 0.067816 0.019197 3.53 0.00049 ***
CHits 0.000399 0.000386 1.03 0.30206
CRuns -0.000288 0.000696 -0.41 0.67941
DivisionW -0.169181 0.072177 -2.34 0.01985 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.574 on 253 degrees of freedom
Multiple R-squared: 0.589, Adjusted R-squared: 0.578
F-statistic: 51.9 on 7 and 253 DF, p-value: <2e-16
plot(linfit2)
Are the coefficients with the linear fit versus the LASSO found broadly similar?
coef(linfit2)
(Intercept) Hits RBI Walks Years CHits
4.2761274 0.0064352 0.0010187 0.0070254 0.0678155 0.0003987
CRuns DivisionW
-0.0002878 -0.1691809
lassocoefs<-as.matrix(coef(log_fit_lasso_cv_sans2))
lassocoefs[lassocoefs!=0]
[1] 4.60250178 0.00509315 0.00055799 0.00463658 0.03839075 0.00031499
[7] 0.00009165 -0.00915693
plot(coef(linfit2)[2:9]~lassocoefs[lassocoefs!=0][2:9])
Lastly, let’s try an elastic net fit to see if it will bring in both the career averages and the 1986 numbers for RBI and Walks.
log_fit_lasso_cv_sans3 = cv.glmnet(X[-c(173,241),], ysans2, alpha = 0.5)
plotres(log_fit_lasso_cv_sans3)
coef(log_fit_lasso_cv_sans3)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 4.78008050
AtBat .
Hits 0.00337053
HmRun .
Runs 0.00160678
RBI 0.00104338
Walks 0.00338272
Years 0.02248827
CAtBat 0.00003187
CHits 0.00015698
CHmRun .
CRuns 0.00022782
CRBI 0.00012046
CWalks .
LeagueN .
DivisionW .
PutOuts .
Assists .
Errors .
NewLeagueN .
glmnet
Web Vingette - Details from the package developers.