Nested Cross Validation for Hyperparameter Search — nested

Run a grid search in a nested cross validation.

nested_gridsearch(
  x,
  y,
  searchspace,
  FUN,
  nouterfolds = 5,
  ninnerfolds = 5,
  nrepcv = 2,
  ...
)

Arguments

x: matrix/data.frame, feature matrix, see ranger() for details.
y: numeric/factor, classification labels, see ranger() for details.
searchspace: data.frame, hyperparameters to tune. Column names have to match the argument names of FUN.
FUN: function function to optimize.
nouterfolds: integer(1), number of outer cross validation folds.
ninnerfolds: integer(1), number of inner cross validation folds.
nrepcv: integer(1), number repeats of inner cross validations.
...: further arguments passed to gs_rusranger().

Value

list, with an element per nouterfolds containing the following subelements:

indextrain index of the used training items.
indextest index of the used test items.
performance resulting performance (AUC).
selectedparams select hyperparameters.
gridsearch data.frame, results of the grid search.
nouterfolds integer(1).
ninnerfolds integer(1).
nrepcv integer(1).

Note

The reported performance could slightly differ from the median performance in the reported gridsearch. After the gridsearch FUN is trained again with the best hyperparameters which results in a new subsampling.

Examples

set.seed(20220324)
iris <- subset(iris, Species != "setosa")
searchspace <- expand.grid(
   mtry = c(2, 3),
   num.trees = c(500, 1000)
)
## n(outer|inner) folds and nrepcv are too low for real world applications,
## and are just used for demonstration and to keep the run time of the examples
## low
nrcv_rusranger(
    iris[-5], as.numeric(iris$Species == "versicolor"),
    searchspace = searchspace, nouterfolds = 3, ninnerfolds = 3, nrepcv = 1
)
#> [[1]]
#> [[1]]$model
#> Ranger result
#> 
#> Call:
#>  ranger(x = as.data.frame(x), y = y, probability = probability,      classification = classification, min.node.size = min.node.size,      replace = replace, case.weights = .caseweights(y, replace = replace),      sample.fraction = .samplefraction(y), ..., keep.inbag = FALSE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  500 
#> Sample size:                      66 
#> Number of independent variables:  4 
#> Mtry:                             3 
#> Target node size:                 10 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  NaN 
#> 
#> [[1]]$indextrain
#>  21  22  23  24  25  26  27  28  29 210 211 212 213 214 215 216 217 218 219 220 
#>   4   8  13  14  15  19  20  26  29  30  34  35  38  39  43  44  49  51  55  58 
#> 221 222 223 224 225 226 227 228 229 230 231 232 233 234  31  32  33  34  35  36 
#>  60  61  63  67  69  74  75  81  83  88  91  94  99 100   1   3   5   9  12  16 
#>  37  38  39 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 
#>  18  22  25  27  28  36  37  47  48  50  62  65  70  71  73  76  77  82  84  85 
#> 327 328 329 330 331 332 
#>  89  90  92  96  97  98 
#> 
#> [[1]]$indextest
#>  11  12  13  14  15  16  17  18  19 110 111 112 113 114 115 116 117 118 119 120 
#>   2   6   7  10  11  17  21  23  24  31  32  33  40  41  42  45  46  52  53  54 
#> 121 122 123 124 125 126 127 128 129 130 131 132 133 134 
#>  56  57  59  64  66  68  72  78  79  80  86  87  93  95 
#> 
#> [[1]]$prediction
#>  [1] 0.0000000 0.0000000 0.0000000 0.0025000 0.0000000 0.0000000 0.9737500
#>  [8] 0.1420357 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [15] 0.0000000 0.0000000 0.0000000 0.9913333 1.0000000 1.0000000 1.0000000
#> [22] 0.0160000 1.0000000 0.9737500 1.0000000 1.0000000 0.9737500 0.9737500
#> [29] 1.0000000 0.5730190 1.0000000 1.0000000 0.9913333 1.0000000
#> 
#> [[1]]$truth
#>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> 
#> [[1]]$performance
#> [1] 0.01557093
#> 
#> [[1]]$selectedparams
#>   mtry num.trees
#> 2    3       500
#> 
#> [[1]]$gridsearch
#>   mtry num.trees         Min          Q1      Median          Q3         Max
#> 1    2       500 0.008333333 0.008333333 0.008333333 0.008333333 0.008333333
#> 2    3       500 0.017857143 0.017857143 0.017857143 0.017857143 0.017857143
#> 3    2      1000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#> 4    3      1000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#> 
#> [[1]]$nouterfolds
#> [1] 3
#> 
#> [[1]]$ninnerfolds
#> [1] 3
#> 
#> [[1]]$nrepcv
#> [1] 1
#> 
#> 
#> [[2]]
#> [[2]]$model
#> Ranger result
#> 
#> Call:
#>  ranger(x = as.data.frame(x), y = y, probability = probability,      classification = classification, min.node.size = min.node.size,      replace = replace, case.weights = .caseweights(y, replace = replace),      sample.fraction = .samplefraction(y), ..., keep.inbag = FALSE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  1000 
#> Sample size:                      66 
#> Number of independent variables:  4 
#> Mtry:                             2 
#> Target node size:                 10 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  NaN 
#> 
#> [[2]]$indextrain
#>  11  12  13  14  15  16  17  18  19 110 111 112 113 114 115 116 117 118 119 120 
#>   2   6   7  10  11  17  21  23  24  31  32  33  40  41  42  45  46  52  53  54 
#> 121 122 123 124 125 126 127 128 129 130 131 132 133 134  31  32  33  34  35  36 
#>  56  57  59  64  66  68  72  78  79  80  86  87  93  95   1   3   5   9  12  16 
#>  37  38  39 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 
#>  18  22  25  27  28  36  37  47  48  50  62  65  70  71  73  76  77  82  84  85 
#> 327 328 329 330 331 332 
#>  89  90  92  96  97  98 
#> 
#> [[2]]$indextest
#>  21  22  23  24  25  26  27  28  29 210 211 212 213 214 215 216 217 218 219 220 
#>   4   8  13  14  15  19  20  26  29  30  34  35  38  39  43  44  49  51  55  58 
#> 221 222 223 224 225 226 227 228 229 230 231 232 233 234 
#>  60  61  63  67  69  74  75  81  83  88  91  94  99 100 
#> 
#> [[2]]$prediction
#>  [1] 0.045900000 0.494400000 0.056433333 0.007100794 0.000250000 0.090016667
#>  [7] 0.026650000 0.001555556 0.007411905 0.003250000 0.776325397 0.004017460
#> [13] 0.056433333 0.000250000 0.013533333 0.045900000 0.026400000 0.988466667
#> [19] 1.000000000 0.987714286 1.000000000 0.988466667 0.992714286 0.983325397
#> [25] 1.000000000 0.919846032 0.981514286 1.000000000 1.000000000 0.983325397
#> [31] 0.992714286 0.981514286 0.988466667 0.951325397
#> 
#> [[2]]$truth
#>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> 
#> [[2]]$performance
#> [1] 0
#> 
#> [[2]]$selectedparams
#>   mtry num.trees
#> 3    2      1000
#> 
#> [[2]]$gridsearch
#>   mtry num.trees         Min          Q1      Median          Q3         Max
#> 1    2       500 0.033333333 0.033333333 0.033333333 0.033333333 0.033333333
#> 2    3       500 0.020661157 0.020661157 0.020661157 0.020661157 0.020661157
#> 3    2      1000 0.057851240 0.057851240 0.057851240 0.057851240 0.057851240
#> 4    3      1000 0.008547009 0.008547009 0.008547009 0.008547009 0.008547009
#> 
#> [[2]]$nouterfolds
#> [1] 3
#> 
#> [[2]]$ninnerfolds
#> [1] 3
#> 
#> [[2]]$nrepcv
#> [1] 1
#> 
#> 
#> [[3]]
#> [[3]]$model
#> Ranger result
#> 
#> Call:
#>  ranger(x = as.data.frame(x), y = y, probability = probability,      classification = classification, min.node.size = min.node.size,      replace = replace, case.weights = .caseweights(y, replace = replace),      sample.fraction = .samplefraction(y), ..., keep.inbag = FALSE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  500 
#> Sample size:                      68 
#> Number of independent variables:  4 
#> Mtry:                             2 
#> Target node size:                 10 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  NaN 
#> 
#> [[3]]$indextrain
#>  11  12  13  14  15  16  17  18  19 110 111 112 113 114 115 116 117 118 119 120 
#>   2   6   7  10  11  17  21  23  24  31  32  33  40  41  42  45  46  52  53  54 
#> 121 122 123 124 125 126 127 128 129 130 131 132 133 134  21  22  23  24  25  26 
#>  56  57  59  64  66  68  72  78  79  80  86  87  93  95   4   8  13  14  15  19 
#>  27  28  29 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 
#>  20  26  29  30  34  35  38  39  43  44  49  51  55  58  60  61  63  67  69  74 
#> 227 228 229 230 231 232 233 234 
#>  75  81  83  88  91  94  99 100 
#> 
#> [[3]]$indextest
#>  31  32  33  34  35  36  37  38  39 310 311 312 313 314 315 316 317 318 319 320 
#>   1   3   5   9  12  16  18  22  25  27  28  36  37  47  48  50  62  65  70  71 
#> 321 322 323 324 325 326 327 328 329 330 331 332 
#>  73  76  77  82  84  85  89  90  92  96  97  98 
#> 
#> [[3]]$prediction
#>  [1] 0.22633333 0.24873333 0.02433333 0.02433333 0.00000000 0.02633333
#>  [7] 0.00000000 0.00000000 0.02433333 0.02633333 0.72173333 0.05220000
#> [13] 0.02633333 0.00000000 0.00000000 0.00000000 0.97300000 0.95382857
#> [19] 0.24197143 1.00000000 1.00000000 0.99100000 0.52140000 1.00000000
#> [25] 0.31340000 0.45557143 0.50225714 1.00000000 0.99040000 1.00000000
#> [31] 0.88473333 1.00000000
#> 
#> [[3]]$truth
#>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> 
#> [[3]]$performance
#> [1] 0.0234375
#> 
#> [[3]]$selectedparams
#>   mtry num.trees
#> 1    2       500
#> 
#> [[3]]$gridsearch
#>   mtry num.trees Min Q1 Median Q3 Max
#> 1    2       500   0  0      0  0   0
#> 2    3       500   0  0      0  0   0
#> 3    2      1000   0  0      0  0   0
#> 4    3      1000   0  0      0  0   0
#> 
#> [[3]]$nouterfolds
#> [1] 3
#> 
#> [[3]]$ninnerfolds
#> [1] 3
#> 
#> [[3]]$nrepcv
#> [1] 1
#> 
#>