The EnsembleFSResult
stores the results of ensemble feature selection.
It includes methods for evaluating the stability of the feature selection process and for ranking the selected features among others.
The function ensemble_fselect()
returns an object of this class.
S3 Methods
as.data.table.EnsembleFSResult(x, benchmark_result = TRUE)
Returns a tabular view of the ensemble feature selection.
EnsembleFSResult ->data.table::data.table()
x
(EnsembleFSResult)benchmark_result
(logical(1)
)
Whether to add the learner, task and resampling information from the benchmark result.
References
Das, I (1999). “On characterizing the 'knee' of the Pareto curve based on normal-boundary intersection.” Structural Optimization, 18(1-2), 107–115. ISSN 09344373.
Public fields
benchmark_result
(mlr3::BenchmarkResult)
The benchmark result.man
(
character(1)
)
Manual page for this object.
Active bindings
result
(data.table::data.table)
Returns the result of the ensemble feature selection.n_learners
(
numeric(1)
)
Returns the number of learners used in the ensemble feature selection.measure
(
character(1)
)
Returns the measure id used in the ensemble feature selection.
Methods
Method new()
Creates a new instance of this R6 class.
Usage
EnsembleFSResult$new(
result,
features,
benchmark_result = NULL,
measure_id,
minimize = TRUE
)
Arguments
result
(data.table::data.table)
The result of the ensemble feature selection. Column names should include"resampling_iteration"
,"learner_id"
,"features"
and"n_features"
.features
(
character()
)
The vector of features of the task that was used in the ensemble feature selection.benchmark_result
(mlr3::BenchmarkResult)
The benchmark result object.measure_id
(
character(1)
)
Column name of"result"
that corresponds to the measure used.minimize
(
logical(1)
)
IfTRUE
(default), lower values of the measure correspond to higher performance.
Method format()
Helper for print outputs.
Method help()
Opens the corresponding help page referenced by field $man
.
Method feature_ranking()
Calculates the feature ranking.
Details
The feature ranking process is built on the following framework: models act as voters, features act as candidates, and voters select certain candidates (features).
The primary objective is to compile these selections into a consensus ranked list of features, effectively forming a committee.
Currently, only "approval_voting"
method is supported, which selects the candidates/features that have the highest approval score or selection frequency, i.e. appear the most often.
Returns
A data.table::data.table listing all the features, ordered by decreasing inclusion probability scores (depending on the method
)
Method stability()
Calculates the stability of the selected features with the stabm package. The results are cached. When the same stability measure is requested again with different arguments, the cache must be reset.
Usage
EnsembleFSResult$stability(
stability_measure = "jaccard",
stability_args = NULL,
global = TRUE,
reset_cache = FALSE
)
Arguments
stability_measure
(
character(1)
)
The stability measure to be used. One of the measures returned bystabm::listStabilityMeasures()
in lower case. Default is"jaccard"
.stability_args
(
list
)
Additional arguments passed to the stability measure function.global
(
logical(1)
)
Whether to calculate the stability globally or for each learner.reset_cache
(
logical(1)
)
IfTRUE
, the cached results are ignored.
Method pareto_front()
This function identifies the Pareto front of the ensemble feature selection process, i.e., the set of points that represent the trade-off between the number of features and performance (e.g. classification error).
Details
Two options are available for the Pareto front:
"empirical"
(default): returns the empirical Pareto front."estimated"
: the Pareto front points are estimated by fitting a linear model with the inversed of the number of features (\(1/x\)) as input and the associated performance scores as output. This method is useful when the Pareto points are sparse and the front assumes a convex shape if better performance corresponds to lower measure values (e.g. classification error), or a concave shape otherwise (e.g. classification accuracy). Theestimated
Pareto front will include points for a number of features ranging from 1 up to the maximum number found in the empirical Pareto front.
Returns
A data.table::data.table with columns the number of features and the performance that together form the Pareto front.
Method knee_points()
This function implements various knee point identification (KPI) methods, which select points in the Pareto front, such that an optimal trade-off between performance and number of features is achieved. In most cases, only one such point is returned.
Arguments
method
(
character(1)
)
Type of method to use to identify the knee point. See details.type
(
character(1)
)
Specifies the type of Pareto front to use for the identification of the knee point. Seepareto_front()
method for more details.
Details
The available KPI methods are:
"NBI"
(default): The Normal-Boundary Intersection method is a geometry-based method which calculates the perpendicular distance of each point from the line connecting the first and last points of the Pareto front. The knee point is determined as the Pareto point with the maximum distance from this line, see Das (1999).
Returns
A data.table::data.table with the knee point(s) of the Pareto front.
Examples
# \donttest{
efsr = ensemble_fselect(
fselector = fs("rfe", n_features = 2, feature_fraction = 0.8),
task = tsk("sonar"),
learners = lrns(c("classif.rpart", "classif.featureless")),
init_resampling = rsmp("subsampling", repeats = 2),
inner_resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
terminator = trm("none")
)
# contains the benchmark result
efsr$benchmark_result
#> <BenchmarkResult> of 4 rows with 4 resampling runs
#> nr task_id learner_id resampling_id iters warnings errors
#> 1 sonar classif.rpart.fselector insample 1 0 0
#> 2 sonar classif.featureless.fselector insample 1 0 0
#> 3 sonar classif.rpart.fselector insample 1 0 0
#> 4 sonar classif.featureless.fselector insample 1 0 0
# contains the selected features for each iteration
efsr$result
#> resampling_iteration learner_id features
#> <int> <char> <list>
#> 1: 1 classif.rpart V10,V11,V12,V13,V16,V17,...
#> 2: 1 classif.featureless V27,V34
#> 3: 2 classif.rpart V11,V12,V16
#> 4: 2 classif.featureless V36,V54
#> n_features classif.ce
#> <int> <num>
#> 1: 12 0.2880049
#> 2: 2 0.4892075
#> 3: 3 0.2516189
#> 4: 2 0.4605304
#> importance
#> <list>
#> 1: 12.000000, 9.666667, 9.666667, 7.666667, 7.000000, 6.666667,...
#> 2: 2,1
#> 3: 2.333333,2.333333,1.333333
#> 4: 1.666667,1.333333
#> task learner
#> <list> <list>
#> 1: <TaskClassif:sonar> <AutoFSelector:classif.rpart.fselector>
#> 2: <TaskClassif:sonar> <AutoFSelector:classif.featureless.fselector>
#> 3: <TaskClassif:sonar> <AutoFSelector:classif.rpart.fselector>
#> 4: <TaskClassif:sonar> <AutoFSelector:classif.featureless.fselector>
#> resampling
#> <list>
#> 1: <ResamplingInsample>
#> 2: <ResamplingInsample>
#> 3: <ResamplingInsample>
#> 4: <ResamplingInsample>
# returns the stability of the selected features
efsr$stability(stability_measure = "jaccard")
#> [1] 0.04166667
# returns a ranking of all features
head(efsr$feature_ranking())
#> feature inclusion_probability
#> <char> <num>
#> 1: V11 0.50
#> 2: V12 0.50
#> 3: V16 0.50
#> 4: V10 0.25
#> 5: V13 0.25
#> 6: V17 0.25
# returns the empirical pareto front (nfeatures vs error)
efsr$pareto_front()
#> n_features classif.ce
#> <num> <num>
#> 1: 2 0.4892075
#> 2: 2 0.4605304
#> 3: 3 0.2516189
# }