Skip to contents

The AutoFSelector wraps a mlr3::Learner and augments it with an automatic feature selection. The auto_fselector() function creates an AutoFSelector object.

Usage

auto_fselector(
  fselector,
  learner,
  resampling,
  measure = NULL,
  term_evals = NULL,
  term_time = NULL,
  terminator = NULL,
  store_fselect_instance = TRUE,
  store_benchmark_result = TRUE,
  store_models = FALSE,
  check_values = FALSE,
  callbacks = NULL,
  ties_method = "least_features",
  id = NULL
)

Arguments

fselector

(FSelector)
Optimization algorithm.

learner

(mlr3::Learner)
Learner to optimize the feature subset for.

resampling

(mlr3::Resampling)
Resampling that is used to evaluated the performance of the feature subsets. Uninstantiated resamplings are instantiated during construction so that all feature subsets are evaluated on the same data splits. Already instantiated resamplings are kept unchanged.

measure

(mlr3::Measure)
Measure to optimize. If NULL, default measure is used.

term_evals

(integer(1))
Number of allowed evaluations. Ignored if terminator is passed.

term_time

(integer(1))
Maximum allowed time in seconds. Ignored if terminator is passed.

terminator

(bbotk::Terminator)
Stop criterion of the feature selection.

store_fselect_instance

(logical(1))
If TRUE (default), stores the internally created FSelectInstanceBatchSingleCrit with all intermediate results in slot $fselect_instance. Is set to TRUE, if store_models = TRUE

store_benchmark_result

(logical(1))
Store benchmark result in archive?

store_models

(logical(1)). Store models in benchmark result?

check_values

(logical(1))
Check the parameters before the evaluation and the results for validity?

callbacks

(list of CallbackBatchFSelect)
List of callbacks.

ties_method

(character(1))
The method to break ties when selecting sets while optimizing and when selecting the best set. Can be "least_features" or "random". The option "least_features" (default) selects the feature set with the least features. If there are multiple best feature sets with the same number of features, one is selected randomly. The random method returns a random feature set from the best feature sets. Ignored if multiple measures are used.

id

(character(1))
Identifier for the new instance.

Value

AutoFSelector.

Details

The AutoFSelector is a mlr3::Learner which wraps another mlr3::Learner and performs the following steps during $train():

  1. The wrapped (inner) learner is trained on the feature subsets via resampling. The feature selection can be specified by providing a FSelector, a bbotk::Terminator, a mlr3::Resampling and a mlr3::Measure.

  2. A final model is fit on the complete training data with the best-found feature subset.

During $predict() the AutoFSelector just calls the predict method of the wrapped (inner) learner.

Resources

There are several sections about feature selection in the mlr3book.

The gallery features a collection of case studies and demos about optimization.

Nested Resampling

Nested resampling can be performed by passing an AutoFSelector object to mlr3::resample() or mlr3::benchmark(). To access the inner resampling results, set store_fselect_instance = TRUE and execute mlr3::resample() or mlr3::benchmark() with store_models = TRUE (see examples). The mlr3::Resampling passed to the AutoFSelector is meant to be the inner resampling, operating on the training set of an arbitrary outer resampling. For this reason it is not feasible to pass an instantiated mlr3::Resampling here.

Examples

# Automatic Feature Selection
# \donttest{

# split to train and external set
task = tsk("penguins")
split = partition(task, ratio = 0.8)

# create auto fselector
afs = auto_fselector(
  fselector = fs("random_search"),
  learner = lrn("classif.rpart"),
  resampling = rsmp ("holdout"),
  measure = msr("classif.ce"),
  term_evals = 4)

# optimize feature subset and fit final model
afs$train(task, row_ids = split$train)

# predict with final model
afs$predict(task, row_ids = split$test)
#> <PredictionClassif> for 69 observations:
#>  row_ids     truth  response
#>       12    Adelie    Adelie
#>       14    Adelie    Adelie
#>       20    Adelie Chinstrap
#>      ---       ---       ---
#>      321 Chinstrap Chinstrap
#>      331 Chinstrap    Adelie
#>      338 Chinstrap Chinstrap

# show result
afs$fselect_result
#>    bill_depth bill_length body_mass flipper_length island    sex   year
#>        <lgcl>      <lgcl>    <lgcl>         <lgcl> <lgcl> <lgcl> <lgcl>
#> 1:       TRUE        TRUE     FALSE           TRUE   TRUE  FALSE   TRUE
#>                                             features n_features classif.ce
#>                                               <list>      <int>      <num>
#> 1: bill_depth,bill_length,flipper_length,island,year          5 0.07608696

# model slot contains trained learner and fselect instance
afs$model
#> $learner
#> <LearnerClassifRpart:classif.rpart>: Classification Tree
#> * Model: rpart
#> * Parameters: xval=0
#> * Packages: mlr3, rpart
#> * Predict Types:  [response], prob
#> * Feature Types: logical, integer, numeric, factor, ordered
#> * Properties: importance, missings, multiclass, selected_features,
#>   twoclass, weights
#> 
#> $features
#> [1] "bill_depth"     "bill_length"    "flipper_length" "island"        
#> [5] "year"          
#> 
#> $fselect_instance
#> <FSelectInstanceBatchSingleCrit>
#> * State:  Optimized
#> * Objective: <ObjectiveFSelectBatch:classif.rpart_on_penguins>
#> * Terminator: <TerminatorEvals>
#> * Result:
#>    bill_depth bill_length body_mass flipper_length island    sex   year
#>        <lgcl>      <lgcl>    <lgcl>         <lgcl> <lgcl> <lgcl> <lgcl>
#> 1:       TRUE        TRUE     FALSE           TRUE   TRUE  FALSE   TRUE
#>    classif.ce
#>         <num>
#> 1: 0.07608696
#> * Archive:
#>     bill_depth bill_length body_mass flipper_length island    sex   year
#>         <lgcl>      <lgcl>    <lgcl>         <lgcl> <lgcl> <lgcl> <lgcl>
#>  1:       TRUE        TRUE      TRUE           TRUE   TRUE   TRUE   TRUE
#>  2:      FALSE       FALSE     FALSE          FALSE   TRUE  FALSE   TRUE
#>  3:      FALSE        TRUE      TRUE          FALSE  FALSE   TRUE   TRUE
#>  4:       TRUE        TRUE      TRUE           TRUE   TRUE   TRUE  FALSE
#>  5:       TRUE       FALSE     FALSE          FALSE  FALSE   TRUE  FALSE
#>  6:       TRUE        TRUE      TRUE          FALSE  FALSE   TRUE  FALSE
#>  7:      FALSE       FALSE     FALSE           TRUE   TRUE   TRUE  FALSE
#>  8:      FALSE        TRUE      TRUE           TRUE  FALSE   TRUE  FALSE
#>  9:      FALSE        TRUE      TRUE           TRUE  FALSE  FALSE  FALSE
#> 10:       TRUE        TRUE     FALSE           TRUE   TRUE  FALSE   TRUE
#>     classif.ce
#>          <num>
#>  1: 0.07608696
#>  2: 0.32608696
#>  3: 0.10869565
#>  4: 0.07608696
#>  5: 0.22826087
#>  6: 0.15217391
#>  7: 0.21739130
#>  8: 0.10869565
#>  9: 0.10869565
#> 10: 0.07608696
#> 

# shortcut trained learner
afs$learner
#> <LearnerClassifRpart:classif.rpart>: Classification Tree
#> * Model: rpart
#> * Parameters: xval=0
#> * Packages: mlr3, rpart
#> * Predict Types:  [response], prob
#> * Feature Types: logical, integer, numeric, factor, ordered
#> * Properties: importance, missings, multiclass, selected_features,
#>   twoclass, weights

# shortcut fselect instance
afs$fselect_instance
#> <FSelectInstanceBatchSingleCrit>
#> * State:  Optimized
#> * Objective: <ObjectiveFSelectBatch:classif.rpart_on_penguins>
#> * Terminator: <TerminatorEvals>
#> * Result:
#>    bill_depth bill_length body_mass flipper_length island    sex   year
#>        <lgcl>      <lgcl>    <lgcl>         <lgcl> <lgcl> <lgcl> <lgcl>
#> 1:       TRUE        TRUE     FALSE           TRUE   TRUE  FALSE   TRUE
#>    classif.ce
#>         <num>
#> 1: 0.07608696
#> * Archive:
#>     bill_depth bill_length body_mass flipper_length island    sex   year
#>         <lgcl>      <lgcl>    <lgcl>         <lgcl> <lgcl> <lgcl> <lgcl>
#>  1:       TRUE        TRUE      TRUE           TRUE   TRUE   TRUE   TRUE
#>  2:      FALSE       FALSE     FALSE          FALSE   TRUE  FALSE   TRUE
#>  3:      FALSE        TRUE      TRUE          FALSE  FALSE   TRUE   TRUE
#>  4:       TRUE        TRUE      TRUE           TRUE   TRUE   TRUE  FALSE
#>  5:       TRUE       FALSE     FALSE          FALSE  FALSE   TRUE  FALSE
#>  6:       TRUE        TRUE      TRUE          FALSE  FALSE   TRUE  FALSE
#>  7:      FALSE       FALSE     FALSE           TRUE   TRUE   TRUE  FALSE
#>  8:      FALSE        TRUE      TRUE           TRUE  FALSE   TRUE  FALSE
#>  9:      FALSE        TRUE      TRUE           TRUE  FALSE  FALSE  FALSE
#> 10:       TRUE        TRUE     FALSE           TRUE   TRUE  FALSE   TRUE
#>     classif.ce
#>          <num>
#>  1: 0.07608696
#>  2: 0.32608696
#>  3: 0.10869565
#>  4: 0.07608696
#>  5: 0.22826087
#>  6: 0.15217391
#>  7: 0.21739130
#>  8: 0.10869565
#>  9: 0.10869565
#> 10: 0.07608696


# Nested Resampling

afs = auto_fselector(
  fselector = fs("random_search"),
  learner = lrn("classif.rpart"),
  resampling = rsmp ("holdout"),
  measure = msr("classif.ce"),
  term_evals = 4)

resampling_outer = rsmp("cv", folds = 3)
rr = resample(task, afs, resampling_outer, store_models = TRUE)

# retrieve inner feature selection results.
extract_inner_fselect_results(rr)
#>    iteration bill_depth bill_length body_mass flipper_length island    sex
#>        <int>     <lgcl>      <lgcl>    <lgcl>         <lgcl> <lgcl> <lgcl>
#> 1:         1      FALSE        TRUE      TRUE          FALSE   TRUE  FALSE
#> 2:         2       TRUE        TRUE      TRUE          FALSE  FALSE  FALSE
#> 3:         3      FALSE        TRUE     FALSE           TRUE   TRUE  FALSE
#>      year classif.ce                              features n_features  task_id
#>    <lgcl>      <num>                                <list>      <int>   <char>
#> 1:  FALSE 0.03947368          bill_length,body_mass,island          3 penguins
#> 2:   TRUE 0.06578947 bill_depth,bill_length,body_mass,year          4 penguins
#> 3:  FALSE 0.03896104     bill_length,flipper_length,island          3 penguins
#>                 learner_id resampling_id
#>                     <char>        <char>
#> 1: classif.rpart.fselector            cv
#> 2: classif.rpart.fselector            cv
#> 3: classif.rpart.fselector            cv

# performance scores estimated on the outer resampling
rr$score()
#>     task_id              learner_id resampling_id iteration classif.ce
#>      <char>                  <char>        <char>     <int>      <num>
#> 1: penguins classif.rpart.fselector            cv         1 0.05217391
#> 2: penguins classif.rpart.fselector            cv         2 0.06956522
#> 3: penguins classif.rpart.fselector            cv         3 0.07894737
#> Hidden columns: task, learner, resampling, prediction_test

# unbiased performance of the final model trained on the full data set
rr$aggregate()
#> classif.ce 
#>  0.0668955 
# }