Regularization

Lecture 1

Lecture modified from Chapter 24 in ‘R for Statistical Learning’, by David Dalpiaz

Setting up the data

We will use the Hitters dataset from the ISLR package to explore two shrinkage methods: ridge regression and lasso. These are otherwise known as penalized regression methods.

data(Hitters, package = "ISLR")

This dataset has some missing data in the response Salaray. We use the na.omit() function the clean the dataset.

sum(is.na(Hitters))
[1] 59
sum(is.na(Hitters$Salary))
[1] 59
Hitters = na.omit(Hitters)
sum(is.na(Hitters))
[1] 0

The predictors variables are offensive and defensive statistics for a number of baseball players.

names(Hitters)
 [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
 [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
[13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
[19] "Salary"    "NewLeague"

Let’s look at just a few of the variables:

library(GGally)
ggpairs(dplyr::select(Hitters,Salary,'Hits','CHits',HmRun,CHmRun,Runs,CRuns))

We use the glmnet() and cv.glmnet() functions from the glmnet package to fit penalized regressions.

library(glmnet)

Unfortunately, the glmnet function does not allow the use of model formulas, so we setup the data for ease of use with glmnet. Eventually we will use train() from caret which does allow for fitting penalized regression with the formula syntax, but to explore some of the details, we first work with the functions from glmnet directly.

X = model.matrix(Salary ~ ., Hitters)[, -1]
y = Hitters$Salary

Linear regression

First, we fit an ordinary linear regression, and note the size of the predictors’ coefficients, and predictors’ coefficients squared. (The two penalties we will use.)

fit = lm(Salary ~ ., Hitters)
coef(fit)
(Intercept)       AtBat        Hits       HmRun        Runs         RBI 
   163.1036     -1.9799      7.5008      4.3309     -2.3762     -1.0450 
      Walks       Years      CAtBat       CHits      CHmRun       CRuns 
     6.2313     -3.4891     -0.1713      0.1340     -0.1729      1.4543 
       CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
     0.8077     -0.8116     62.5994   -116.8492      0.2819      0.3711 
     Errors  NewLeagueN 
    -3.3608    -24.7623 
sum(abs(coef(fit)[-1]))
[1] 238.7
sum(coef(fit)[-1] ^ 2)
[1] 18337
vif(fit)
    AtBat      Hits     HmRun      Runs       RBI     Walks     Years    CAtBat 
   22.944    30.281     7.759    15.246    11.922     4.149     9.313   251.561 
    CHits    CHmRun     CRuns      CRBI    CWalks    League  Division   PutOuts 
  502.954    46.488   162.521   131.966    19.744     4.134     1.075     1.236 
  Assists    Errors NewLeague 
    2.709     2.215     4.099

Note that we have a lot of coefficients with large values. The VIFs indicate that we have a high degree of colinearity.

The sum of the absolute values of the coefficients is 238.7, and the sum of the coefficients squared is 18,337. These two sums are what are penalized by ridge regression and lasso, respectively, so we should see them decrease when we use those methods.

Ridge Regression

We first illustrate ridge regression, which can be fit using glmnet() with alpha = 0 and seeks to minimize

\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} \beta_j^2 . \]

Notice that the intercept is not penalized. Also, note that that ridge regression is not scale invariant like the usual unpenalized regression. Thankfully, glmnet() takes care of this internally. It automatically standardizes predictors for fitting, then reports fitted coefficient using the original scale.

The two plots illustrate how much the coefficients are penalized for different values of $\lambda$. Notice none of the coefficients are forced to be zero.

library(glmnet)
par(mfrow = c(1, 2))
fit_ridge = glmnet(X, y, alpha = 0)
plot(fit_ridge, xvar = "lambda", label = TRUE)
plot(fit_ridge)

We use cross-validation to select a good $\lambda$ value. The cv.glmnet()function uses 10 folds by default. The plot illustrates the MSE for the $\lambda$s considered. Two lines are drawn. The first is the $\lambda$ that gives the smallest MSE. The second is the $\lambda$ that gives an MSE within one standard error of the smallest.

fit_ridge_cv = cv.glmnet(X, y, alpha = 0)
plot(fit_ridge_cv)

The cv.glmnet() function returns several details of the fit for both $\lambda$ values in the plot. Notice the penalty terms are smaller than the full linear regression. (As we would expect.)

# fitted coefficients, using 1-SE rule lambda, default behavior
coef(fit_ridge_cv)
20 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) 213.066444
AtBat         0.090096
Hits          0.371253
HmRun         1.180127
Runs          0.596298
RBI           0.594502
Walks         0.772525
Years         2.473494
CAtBat        0.007598
CHits         0.029272
CHmRun        0.217336
CRuns         0.058705
CRBI          0.060722
CWalks        0.058699
LeagueN       3.276568
DivisionW   -21.889943
PutOuts       0.052667
Assists       0.007464
Errors       -0.145121
NewLeagueN    2.972759

# fitted coefficients, using minimum lambda
coef(fit_ridge_cv, s = "lambda.min")
20 x 1 sparse Matrix of class "dgCMatrix"
                       1
(Intercept)   76.4582387
AtBat         -0.6315180
Hits           2.6421596
HmRun         -1.3882331
Runs           1.0457295
RBI            0.7315713
Walks          3.2780010
Years         -8.7237336
CAtBat         0.0001256
CHits          0.1318975
CHmRun         0.6895578
CRuns          0.2830055
CRBI           0.2514905
CWalks        -0.2599851
LeagueN       52.3371962
DivisionW   -122.4169978
PutOuts        0.2623667
Assists        0.1629044
Errors        -3.6440019
NewLeagueN   -17.0259792

# penalty term using minimum lambda
sum(coef(fit_ridge_cv, s = "lambda.min")[-1] ^ 2)
[1] 18127

# fitted coefficients, using 1-SE rule lambda
coef(fit_ridge_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) 213.066444
AtBat         0.090096
Hits          0.371253
HmRun         1.180127
Runs          0.596298
RBI           0.594502
Walks         0.772525
Years         2.473494
CAtBat        0.007598
CHits         0.029272
CHmRun        0.217336
CRuns         0.058705
CRBI          0.060722
CWalks        0.058699
LeagueN       3.276568
DivisionW   -21.889943
PutOuts       0.052667
Assists       0.007464
Errors       -0.145121
NewLeagueN    2.972759

# penalty term using 1-SE rule lambda
sum(coef(fit_ridge_cv, s = "lambda.1se")[-1] ^ 2)
[1] 507.8

# predict using minimum lambda
predict(fit_ridge_cv, X, s = "lambda.min")

# predict using 1-SE rule lambda, default behavior
predict(fit_ridge_cv, X)

# calcualte "train error"
mean((y - predict(fit_ridge_cv, X)) ^ 2)
[1] 132356

# CV-RMSE using minimum lambda
sqrt(fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.min])
[1] 334

# CV-RMSE using 1-SE rule lambda
sqrt(fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.1se]) 
[1] 368.7

Lasso

We now illustrate lasso, which can be fit using glmnet() with alpha = 1 and seeks to minimize

\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} |\beta_j| . \]

Like ridge, lasso is not scale invariant.

The two plots illustrate how much the coefficients are penalized for different values of $\lambda$. Notice some of the coefficients are forced to be zero.

par(mfrow = c(1, 2))
fit_lasso = glmnet(X, y, alpha = 1)
plot(fit_lasso, xvar = "lambda", label = TRUE)
plot(fit_lasso)

Again, to actually pick a $\lambda$, we will use cross-validation. The plot is similar to the ridge plot. Notice along the top is the number of features in the model. (Which changed in this plot.)

fit_lasso_cv = cv.glmnet(X, y, alpha = 1)
plot(fit_lasso_cv)

cv.glmnet() returns several details of the fit for both $\lambda$ values in the plot. Notice the penalty terms are again smaller than the full linear regression. (As we would expect.) Some coefficients are 0.

# fitted coefficients, using 1-SE rule lambda, default behavior
coef(fit_lasso_cv)
20 x 1 sparse Matrix of class "dgCMatrix"
                    1
(Intercept) 144.37970
AtBat         .      
Hits          1.36380
HmRun         .      
Runs          .      
RBI           .      
Walks         1.49731
Years         .      
CAtBat        .      
CHits         .      
CHmRun        .      
CRuns         0.15275
CRBI          0.32834
CWalks        .      
LeagueN       .      
DivisionW     .      
PutOuts       0.06626
Assists       .      
Errors        .      
NewLeagueN    .

# fitted coefficients, using minimum lambda
coef(fit_lasso_cv, s = "lambda.min")
20 x 1 sparse Matrix of class "dgCMatrix"
                    1
(Intercept)  129.4156
AtBat         -1.6130
Hits           5.8059
HmRun          .     
Runs           .     
RBI            .     
Walks          4.8469
Years         -9.9724
CAtBat         .     
CHits          .     
CHmRun         0.5375
CRuns          0.6812
CRBI           0.3904
CWalks        -0.5560
LeagueN       32.4646
DivisionW   -119.3481
PutOuts        0.2742
Assists        0.1856
Errors        -2.1651
NewLeagueN     .

# penalty term using minimum lambda
sum(coef(fit_lasso_cv, s = "lambda.min")[-1] ^ 2)
[1] 15463

# fitted coefficients, using 1-SE rule lambda
coef(fit_lasso_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
                    1
(Intercept) 144.37970
AtBat         .      
Hits          1.36380
HmRun         .      
Runs          .      
RBI           .      
Walks         1.49731
Years         .      
CAtBat        .      
CHits         .      
CHmRun        .      
CRuns         0.15275
CRBI          0.32834
CWalks        .      
LeagueN       .      
DivisionW     .      
PutOuts       0.06626
Assists       .      
Errors        .      
NewLeagueN    .

# penalty term using 1-SE rule lambda
sum(coef(fit_lasso_cv, s = "lambda.1se")[-1] ^ 2)
[1] 4.237

# predict using minimum lambda
predict(fit_lasso_cv, X, s = "lambda.min")

# predict using 1-SE rule lambda, default behavior
predict(fit_lasso_cv, X)

# calcualte "train error"
mean((y - predict(fit_lasso_cv, X)) ^ 2)
[1] 121291

`broom`

Sometimes, the output from glmnet() can be overwhelming. The broom package can help with that.

library(broom)
# the output from the commented line would be immense
# fit_lasso_cv
tidy(fit_lasso_cv)
     lambda estimate std.error conf.high conf.low nzero
1  255.2821   204878     24844    229722   180034     0
2  232.6035   196011     24615    220626   171396     1
3  211.9397   186655     23805    210460   162849     2
4  193.1115   178815     23087    201902   155728     2
5  175.9560   171624     22593    194218   149031     3
6  160.3246   164416     22349    186764   142067     4
7  146.0818   157443     21874    179317   135569     4
8  133.1043   151266     21386    172651   129880     4
9  121.2797   146237     20981    167217   125256     4
10 110.5055   142146     20678    162825   121468     4
11 100.6885   138784     20493    159277   118291     5
12  91.7436   135832     20401    156233   115431     5
13  83.5934   133379     20364    153743   113015     5
14  76.1672   131379     20357    151735   111022     5
15  69.4007   129477     20400    149877   109077     6
16  63.2353   127688     20467    148155   107221     6
17  57.6177   125985     20403    146388   105581     6
18  52.4991   124195     20170    144365   104025     6
19  47.8352   122678     19969    142646   102709     6
20  43.5857   121416     19809    141224   101607     6
21  39.7136   120368     19682    140050   100687     6
22  36.1856   119496     19582    139078    99914     6
23  32.9710   118772     19503    138275    99269     6
24  30.0419   118171     19440    137611    98730     6
25  27.3731   117670     19392    137062    98278     6
26  24.9413   117266     19353    136618    97913     6
27  22.7256   116991     19327    136319    97664     6
28  20.7067   116790     19303    136093    97486     6
29  18.8672   116642     19286    135927    97356     6
30  17.1911   116544     19272    135816    97272     7
31  15.6639   116500     19272    135772    97227     7
32  14.2723   116446     19277    135723    97169     7
33  13.0044   116409     19288    135696    97121     9
34  11.8491   116392     19296    135688    97095     9
35  10.7965   116445     19291    135736    97154     9
36   9.8374   116830     19280    136110    97550     9
37   8.9634   117368     19251    136619    98117     9
38   8.1672   117774     19195    136969    98580    11
39   7.4416   117864     19168    137032    98696    11
40   6.7805   117642     19149    136791    98493    12
41   6.1782   117236     19144    136380    98091    12
42   5.6293   116725     19048    135773    97676    13
43   5.1292   116282     18971    135254    97311    13
44   4.6735   115998     18899    134897    97099    13
45   4.2584   115713     18820    134533    96893    13
46   3.8801   115425     18743    134168    96682    13
47   3.5354   115185     18682    133867    96503    13
48   3.2213   114975     18638    133613    96337    13
49   2.9351   114787     18568    133354    96219    13
50   2.6744   114655     18463    133118    96193    13
51   2.4368   114592     18367    132959    96226    13
52   2.2203   114642     18281    132923    96361    14
53   2.0231   114741     18198    132939    96543    15
54   1.8433   114865     18119    132984    96745    15
55   1.6796   115034     18017    133051    97017    17
56   1.5304   115247     17951    133198    97296    17
57   1.3944   115452     17901    133353    97551    17
58   1.2705   115629     17857    133485    97772    17
59   1.1577   115849     17830    133680    98019    17
60   1.0548   116050     17799    133849    98251    17
61   0.9611   116216     17762    133978    98454    17
62   0.8757   116367     17727    134094    98640    17
63   0.7979   116510     17696    134207    98814    17
64   0.7271   116654     17664    134319    98990    17
65   0.6625   116795     17636    134431    99159    18
66   0.6036   116944     17615    134558    99329    18
67   0.5500   117071     17600    134671    99470    18
68   0.5011   117198     17587    134785    99611    17
69   0.4566   117326     17576    134902    99750    18
70   0.4160   117427     17568    134995    99860    18
71   0.3791   117536     17557    135094    99979    18
72   0.3454   117613     17550    135163   100062    18
# the two lambda values of interest
glance(fit_lasso_cv) 
  lambda.min lambda.1se
1      2.437      76.17

Lecture 2

Simulated Data, $p > n$

Aside from simply shrinking coefficients (ridge) and setting some coefficients to 0 (lasso), penalized regression also has the advantage of being able to handle the $p > n$ case.

set.seed(1234)
n = 1000
p = 5500
X = replicate(p, rnorm(n = n))
beta = c(1, 1, 1, rep(0, 5497))
z = X %*% beta
prob = exp(z) / (1 + exp(z))
y = as.factor(rbinom(length(z), size = 1, prob = prob))

We first simulate a classification example where $p > n$.

# glm(y ~ X, family = "binomial")
# will not converge

We then use a lasso penalty to fit penalized logistic regression. This minimizes

\[ \sum_{i=1}^{n} L\left(y_i, \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}\right) + \lambda \sum_{j=1}^{p} |\beta_j| \]

where $L$ is the appropriate negative log-likelihood.

library(glmnet)
fit_cv = cv.glmnet(X, y, family = "binomial", alpha = 1)
plot(fit_cv)

head(coef(fit_cv), n = 10)
10 x 1 sparse Matrix of class "dgCMatrix"
                  1
(Intercept) 0.02397
V1          0.59675
V2          0.56252
V3          0.60065
V4          .      
V5          .      
V6          .      
V7          .      
V8          .      
V9          .

fit_cv$nzero
 s0  s1  s2  s3  s4  s5  s6  s7  s8  s9 s10 s11 s12 s13 s14 s15 s16 s17 s18 s19 
  0   2   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 s36 s37 s38 s39 
  3   3   3   3   3   3   3   3   3   3   4   6   7  10  18  24  35  54  65  75 
s40 s41 s42 s43 s44 s45 s46 s47 s48 s49 s50 s51 s52 s53 s54 s55 s56 s57 s58 s59 
 86 100 110 129 147 168 187 202 221 241 254 269 283 298 310 324 333 350 364 375 
s60 s61 s62 s63 s64 s65 s66 s67 s68 s69 s70 s71 s72 s73 s74 s75 s76 s77 s78 s79 
387 400 411 429 435 445 453 455 462 466 475 481 487 491 496 498 502 504 512 518 
s80 s81 s82 s83 s84 s85 s86 s87 s88 s89 s90 s91 s92 s93 s94 s95 s96 s97 s98 
523 526 528 536 543 550 559 561 563 566 570 571 576 582 586 590 596 596 600

Notice, only the first three predictors generated are truly significant, and that is exactly what the suggested model finds.

fit_1se = glmnet(X, y, family = "binomial", lambda = fit_cv$lambda.1se)
which(as.vector(as.matrix(fit_1se$beta)) != 0)
[1] 1 2 3

We can also see in the following plots, the three features entering the model well ahead of the irrelevant features.

par(mfrow = c(1, 2))
plot(glmnet(X, y, family = "binomial"))
plot(glmnet(X, y, family = "binomial"), xvar = "lambda")

We can extract the two relevant $\lambda$ values.

fit_cv$lambda.min
[1] 0.03087
fit_cv$lambda.1se
[1] 0.0515

Since cv.glmnet() does not calculate prediction accuracy for classification, we take the $\lambda$ values and create a grid for caret to search in order to obtain prediction accuracy with train(). We set $\alpha = 1$ in this grid, as glmnet can actually tune over the $\alpha = 1$ parameter. (More on that later.)

Note that we have to force y to be a factor, so that train() recognizes we want to have a binomial response. The train() function in caret use the type of variable in y to determine if you want to use family = "binomial" or family = "gaussian".

library(caret)
cv_5 = trainControl(method = "cv", number = 5)
lasso_grid = expand.grid(alpha = 1, 
                         lambda = c(fit_cv$lambda.min, fit_cv$lambda.1se))
lasso_grid
  alpha  lambda
1     1 0.03087
2     1 0.05150

sim_data = data.frame(y, X)
fit_lasso = train(
  y ~ ., data = sim_data,
  method = "glmnet",
  trControl = cv_5,
  tuneGrid = lasso_grid
)
fit_lasso$results
  alpha  lambda Accuracy  Kappa AccuracySD KappaSD
1     1 0.03087   0.7679 0.5358    0.03430 0.06845
2     1 0.05150   0.7689 0.5378    0.02807 0.05596

The interaction between the glmnet and caret packages is sometimes frustrating, but for obtaining results for particular values of $\lambda$, we see it can be easily used.

Feature selection example

Here we analyze data from a microarray study of Singh et al. (2002). This data set contains measurements of the gene expression of 6033 genes for 102 observations: 52 prostate cancer patients and 50 healty men. The goal is to see which genes were most associated with the cancer diagnosis.

library(sda)
data("singh2002")
dim(singh2002$y) # number of observations
NULL
dim(singh2002$x) 
[1]  102 6033

fit <- glmnet(singh2002$x, singh2002$y, family = "binomial",pmax=20,alpha=1)
Warning: from glmnet Fortran code (error code -10015); Number of nonzero
coefficients along the path exceeds pmax=20 at 15th lambda value; solutions for
larger lambdas returned
plot(fit,label=TRUE,xvar="lambda")

fit_lasso_cv = cv.glmnet(singh2002$x, singh2002$y, family = "binomial", nfolds=5,alpha = 0.99,type.measure='class')
coefs<-coef(fit_lasso_cv)
summary(coefs)
6034 x 1 sparse Matrix of class "dgCMatrix", with 44 entries 
      i j         x
1     1 1 -0.493161
2   299 1  0.027846
3   333 1 -0.307869
4   365 1  0.040395
5   378 1 -0.053895
6   453 1 -0.110068
7   580 1 -0.168770
8   611 1 -0.373868
9   703 1  0.009255
10  722 1 -0.015974
11  740 1  0.001608
12  806 1  0.023213
13  915 1 -0.207116
14  922 1  0.154351
15 1069 1 -0.250180
16 1078 1 -0.140912
17 1090 1 -0.207007
18 1114 1 -0.064859
19 1558 1 -0.030528
20 1590 1  0.092322
21 1721 1 -0.414631
22 1967 1 -0.032548
23 2392 1  0.003964
24 2853 1 -0.038024
25 2869 1 -0.060466
26 3018 1  0.101644
27 3188 1  0.001800
28 3283 1 -0.015787
29 3293 1  0.020324
30 3376 1 -0.120177
31 3648 1 -0.121972
32 3697 1  0.003471
33 3880 1 -0.021968
34 3941 1  0.096150
35 4074 1  0.056335
36 4089 1  0.037156
37 4105 1  0.025749
38 4155 1  0.048707
39 4317 1  0.090442
40 4332 1  0.158554
41 4397 1  0.088418
42 4519 1 -0.235190
43 4547 1  0.064363
44 4982 1 -0.089703

Lecture 3

More complete analysis of Hitters data

Let’s look more carefully at the results of the LASSO regression for the Hitters data. The plotres function gives some useful graphs.

library(plotmo)
X = model.matrix(Salary ~ ., Hitters)[, -1]
y = Hitters$Salary
fit_lasso_cv = cv.glmnet(X, y, alpha = 1)
plot(fit_lasso_cv)
plotres(fit_lasso_cv)

I see some definite heteroskedasticity, which is not surprising for a variable like Salary; income data is often very right-skewed and often has non-constant variance. Let’s check to be sure:

hist(Hitters$Salary)

Yup! Let’s try a log transformation to make the response more normal, and see if this fixes the heteroscedasticity.

hist(log(Hitters$Salary))

Looks better already. Let’s do the fit…

logHitters<-mutate(Hitters,logSalary=log(Salary))
#Xlog = model.matrix(logSalary ~ . - Salary, logHitters)[, -1]
ylog = logHitters$logSalary

log_fit_lasso_cv = cv.glmnet(X, ylog, alpha = 1)
plotres(log_fit_lasso_cv)

Did using the log transformation change the predictors found by the LASSO fit? Let’s see…

coef(fit_lasso_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
                    1
(Intercept) 193.74264
AtBat         .      
Hits          1.21471
HmRun         .      
Runs          .      
RBI           .      
Walks         1.28958
Years         .      
CAtBat        .      
CHits         .      
CHmRun        .      
CRuns         0.12924
CRBI          0.31516
CWalks        .      
LeagueN       .      
DivisionW     .      
PutOuts       0.02533
Assists       .      
Errors        .      
NewLeagueN    .      

coef(log_fit_lasso_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
                    1
(Intercept) 4.9970914
AtBat       .        
Hits        0.0033360
HmRun       .        
Runs        .        
RBI         0.0003002
Walks       0.0029740
Years       0.0080495
CAtBat      .        
CHits       0.0003775
CHmRun      .        
CRuns       0.0002244
CRBI        0.0000610
CWalks      .        
LeagueN     .        
DivisionW   .        
PutOuts     .        
Assists     .        
Errors      .        
NewLeagueN  .

Yes, the chosen coefficients aren’t exactly the same.

Now I’m concerned about outliers:

plotres(log_fit_lasso_cv,which=3)

What is going on with Mike Schmidt? He is observation #173. Let’s look at a few observations around him:

Hitters[173:175,c(1:8,19)] # Just show a few predictors for space reasons
                  AtBat Hits HmRun Runs RBI Walks Years CAtBat Salary
-Mike Schmidt        20    1     0    0   0     0     2     41   2127
-Mike Scioscia      374   94     5   36  26    62     7   1968    875
-Mickey Tettleton   211   43    10   26  35    39     3    498    120

He’s had only 1 hit, 0 home runs, 0 runs, and has only been in the league 2 years, yet his salary seems very high. Let’s confirm that:

hist(Hitters$Salary)

He is almost the highest payed player, according to this data. What is going on? Let’s Google Mike Schmidt.

$Mike Schmidt$

Mike Schmidt

Mike Schmidt played with the Phillies starting in 1972, so he’d been playing for 14 years by 1986. According to Wikipedia:

“Schmidt was a twelve-time All-Star and a three-time winner of the National League (NL) Most Valuable Player award (MVP), and he was known for his combination of power hitting and strong defense: as a hitter, he compiled 548 home runs and 1,595 runs batted in (RBIs), and led the NL in home runs eight times and in RBIs four times.”"

This all explains why Schmidt was payed so much, but not why he’s listed as having 0 runs and 0 home runs. I’m going to conclude that there’s something wrong with this data, and run the fit without Schmidt.

#Xsans = model.matrix(logSalary ~ . - Salary, logHitters[-173,])[, -1]
ysans = logHitters[-173,]$logSalary

log_fit_lasso_cv_sans = cv.glmnet(X[-173,], ysans, alpha = 1)
plotres(log_fit_lasso_cv_sans)

Terry Kennedy seems to be a similar situation.

#Xsans = model.matrix(logSalary ~ . - Salary, logHitters[-173,])[, -1]
ysans2 = logHitters[-c(173,241),]$logSalary

log_fit_lasso_cv_sans2 = cv.glmnet(X[-c(173,241),], ysans2, alpha = 1)
plotres(log_fit_lasso_cv_sans2)

Steve Sax is a bit of an outlier here, but I don’t see anything wrong with his data. Also, I doubt he is very influential, since he’s near the middle of the fitted value chart.

Let’s look at the final coefficients:

coef(log_fit_lasso_cv_sans2)
20 x 1 sparse Matrix of class "dgCMatrix"
                      1
(Intercept)  4.60250178
AtBat        .         
Hits         0.00509315
HmRun        .         
Runs         .         
RBI          0.00055799
Walks        0.00463658
Years        0.03839075
CAtBat       .         
CHits        0.00031499
CHmRun       .         
CRuns        0.00009165
CRBI         .         
CWalks       .         
LeagueN      .         
DivisionW   -0.00915693
PutOuts      .         
Assists      .         
Errors       .         
NewLeagueN   .

Would we get similar values if we did a non-regularized fit for the same predictors? (Plus an advantage of the lm – we get nicer diagnostic plots by default:)


linfit2<-lm(log(Salary)~Hits+RBI+Walks+Years+CHits+CRuns+Division,data=Hitters[-c(173,241),])
summary(linfit2)

Call:
lm(formula = log(Salary) ~ Hits + RBI + Walks + Years + CHits + 
    CRuns + Division, data = Hitters[-c(173, 241), ])

Residuals:
   Min     1Q Median     3Q    Max 
-2.064 -0.446  0.068  0.416  1.335 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.276127   0.142734   29.96  < 2e-16 ***
Hits         0.006435   0.001464    4.40 0.000016 ***
RBI          0.001019   0.002317    0.44  0.66058    
Walks        0.007025   0.002321    3.03  0.00273 ** 
Years        0.067816   0.019197    3.53  0.00049 ***
CHits        0.000399   0.000386    1.03  0.30206    
CRuns       -0.000288   0.000696   -0.41  0.67941    
DivisionW   -0.169181   0.072177   -2.34  0.01985 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.574 on 253 degrees of freedom
Multiple R-squared:  0.589, Adjusted R-squared:  0.578 
F-statistic: 51.9 on 7 and 253 DF,  p-value: <2e-16
plot(linfit2)

Are the coefficients with the linear fit versus the LASSO found broadly similar?

coef(linfit2)
(Intercept)        Hits         RBI       Walks       Years       CHits 
  4.2761274   0.0064352   0.0010187   0.0070254   0.0678155   0.0003987 
      CRuns   DivisionW 
 -0.0002878  -0.1691809 
lassocoefs<-as.matrix(coef(log_fit_lasso_cv_sans2))
lassocoefs[lassocoefs!=0]
[1]  4.60250178  0.00509315  0.00055799  0.00463658  0.03839075  0.00031499
[7]  0.00009165 -0.00915693
plot(coef(linfit2)[2:9]~lassocoefs[lassocoefs!=0][2:9])

Lastly, let’s try an elastic net fit to see if it will bring in both the career averages and the 1986 numbers for RBI and Walks.

log_fit_lasso_cv_sans3 = cv.glmnet(X[-c(173,241),], ysans2, alpha = 0.5)
plotres(log_fit_lasso_cv_sans3)

coef(log_fit_lasso_cv_sans3)
20 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) 4.78008050
AtBat       .         
Hits        0.00337053
HmRun       .         
Runs        0.00160678
RBI         0.00104338
Walks       0.00338272
Years       0.02248827
CAtBat      0.00003187
CHits       0.00015698
CHmRun      .         
CRuns       0.00022782
CRBI        0.00012046
CWalks      .         
LeagueN     .         
DivisionW   .         
PutOuts     .         
Assists     .         
Errors      .         
NewLeagueN  .

External Links

glmnet Web Vingette - Details from the package developers.

STAT 224 Regularization