Introduction to bootstrapping with (a)rcv.gmlnet • ameld

Authors: Sebastian Gibb [aut, cre] (https://orcid.org/0000-0001-7406-4443)
Last modified: 2022-12-05 12:37:50
Compiled: Mon Dec 5 12:39:27 2022

Introduction

The ameld R package extends glmnet::cv.glmnet (Friedman et al. 2022; Friedman, Hastie, and Tibshirani 2010). It supports a repeated cross-validation (rcv.glmnet) and a repeated cross-validation to tune alpha and lambda simultaneously (arcv.glmnet). Additionally it provides a bootstrap function that could utilize both functions and supports survival data as described in Harrell, Lee, and Mark (1996).

Dataset

We use the eldd dataset provided by ameld (see ?eldd for details) and standardize it using the zlog (Hoffmann et al. 2017) method.

library("ameld")
library("zlog")
data(eldd)
data(eldr)

# transform reference data.frame for zlog
r <- eldr[c("Code", "AgeDays", "Sex", "LowerLimit", "UpperLimit")]
names(r) <- c("param", "age", "sex", "lower", "upper")
r$age <- r$age / 365.25
r <- set_missing_limits(r)

## we just want to standardize laboratory values
cn <- colnames(eldd)
cnlabs <- cn[grepl("_[SCEFQ1]$", cn)]
zeldd <- eldd
zeldd[c("Age", "Sex", cnlabs)] <- zlog_df(eldd[, c("Age", "Sex", cnlabs)], r)
zeldd[c("Age", "Sex", cnlabs)] <- impute_df(zeldd[c("Age", "Sex", cnlabs)], r)
zeldd <- na.omit(zeldd)

Bootstrapping

Next we apply the bootstrapping. In general the number of bootstrap samples nboot should be equal or larger than 100. We use a much smaller number here to keep the runtime low.

library("future")

## 
## Attaching package: 'future'

## The following object is masked from 'package:survival':
## 
##     cluster

srv <- Surv(zeldd$DaysAtRisk, zeldd$Deceased)
zeldd$DaysAtRisk <- zeldd$Deceased <- NULL
x <- data.matrix(zeldd)

bt <- bootstrap(
    x, srv,
    fun = rcv.glmnet,
    family = "cox",
    nboot = 3,
    nfolds = 3,
    nrep = 2
)

## Loading required package: foreach

We could show an optimism corrected calibration curve.

plot(bt, what = "calibration")

Additionally we could see which variables are selected in each bootstrap step.

plot(bt, what = "selected")

Automatically select best alpha in each Bootstrapping Step.

It is possible to use arcv.glmnet to automatically select the best alpha in each bootstrap step.

selarcv <- function(...) {
    dots <- list(...)
    a <- arcv.glmnet(...)
    i <- which.min.error(a, s = dots$s, maxnnzero = dots$maxnnzero)
    a$models[[i]]
}

bt <- bootstrap(
    x, srv,
    fun = selarcv,
    family = "cox",
    alpha = seq(0, 1, len = 11)^3,
    s = "lambda.1se",
    maxnnzero = 9,
    nboot = 10L, nfolds = 3, nrep = 5,
    m = 50, times = 90
)

Acknowledgment

This work is part of the AMPEL (Analysis and Reporting System for the Improvement of Patient Safety through Real-Time Integration of Laboratory Findings) project.

This measure is co-funded with tax revenues based on the budget adopted by the members of the Saxon State Parliament.

Session Information

sessionInfo()

## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] doFuture_0.12.2 foreach_1.5.2   future_1.29.0   zlog_1.0.1.9000
## [5] ameld_0.0.31    survival_3.4-0  glmnet_4.1-6    Matrix_1.5-1   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.9          highr_0.9           progressr_0.11.0   
##  [4] bslib_0.4.1         compiler_4.2.2      jquerylib_0.1.4    
##  [7] iterators_1.0.14    tools_4.2.2         digest_0.6.30      
## [10] viridisLite_0.4.1   jsonlite_1.8.3      evaluate_0.18      
## [13] memoise_2.0.1       lifecycle_1.0.3     lattice_0.20-45    
## [16] rlang_1.0.6         cli_3.4.1           parallel_4.2.2     
## [19] yaml_2.3.6          pkgdown_2.0.6       xfun_0.35          
## [22] fastmap_1.1.0       stringr_1.5.0       knitr_1.41         
## [25] globals_0.16.2      desc_1.4.2          fs_1.5.2           
## [28] vctrs_0.5.1         sass_0.4.4          systemfonts_1.0.4  
## [31] rprojroot_2.0.3     grid_4.2.2          glue_1.6.2         
## [34] listenv_0.8.0       R6_2.5.1            textshaping_0.3.6  
## [37] future.apply_1.10.0 parallelly_1.32.1   rmarkdown_2.18     
## [40] purrr_0.3.5         magrittr_2.0.3      codetools_0.2-18   
## [43] htmltools_0.5.3     splines_4.2.2       shape_1.4.6        
## [46] ragg_1.2.4          stringi_1.7.8       cachem_1.0.6

References

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. https://doi.org/10.18637/jss.v033.i01.

Friedman, Jerome, Trevor Hastie, Rob Tibshirani, Balasubramanian Narasimhan, Kenneth Tay, Noah Simon, and James Yang. 2022. Glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models. https://CRAN.R-project.org/package=glmnet.

Harrell, Frank E., Kerry L. Lee, and Daniel B. Mark. 1996. “MULTIVARIABLE Prognostic Models: ISSUES in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors.” Statistics in Medicine 15 (4): 361–87. https://doi.org/10.1002/(sici)1097-0258(19960229)15:4<361::aid-sim168>3.0.co;2-4.

Hoffmann, Georg, Frank Klawonn, Ralf Lichtinghagen, and Matthias Orth. 2017. “The Zlog-Value as Basis for the Standardization of Laboratory Results.” LaboratoriumsMedizin 41 (1): 23–32. https://doi.org/10.1515/labmed-2016-0087.