Basic usage

The goal of this vignette is to walk through modeltuning usage in detail. We’ll be training a classification model on the iris data-set to predict whether a flower’s species is Virginica or not.

Load Packages

library(e1071)
library(modeltuning) # devtools::install_github("dmolitor/modeltuning")
library(yardstick)

Data Prep

First, let’s generate a bunch of synthetic data observations by adding random noise to the original iris features and combining it into one big dataframe.

iris_new <- do.call(
  what = rbind,
  args = replicate(n = 10, iris, simplify = FALSE)
) |>
  transform(
    Sepal.Length = jitter(Sepal.Length, 0.1),
    Sepal.Width = jitter(Sepal.Width, 0.1),
    Petal.Length = jitter(Petal.Length, 0.1),
    Petal.Width = jitter(Petal.Width, 0.1),
    Species = factor(Species == "virginica")
  )

# Shuffle the data-set
iris_new <- iris_new[sample(1:nrow(iris_new), nrow(iris_new)), ]

# Quick overview of the dataset
summary(iris_new[, 1:4])
#   Sepal.Length    Sepal.Width     Petal.Length     Petal.Width     
#  Min.   :4.298   Min.   :1.998   Min.   :0.9983   Min.   :0.09802  
#  1st Qu.:5.100   1st Qu.:2.799   1st Qu.:1.5984   1st Qu.:0.29982  
#  Median :5.799   Median :3.001   Median :4.3500   Median :1.30119  
#  Mean   :5.843   Mean   :3.057   Mean   :3.7580   Mean   :1.19931  
#  3rd Qu.:6.400   3rd Qu.:3.302   3rd Qu.:5.1000   3rd Qu.:1.80069  
#  Max.   :7.902   Max.   :4.401   Max.   :6.9014   Max.   :2.50172

Function arguments

Common arguments

The following modeling approach holds for the CV, GridSearch and GridSearchCV classes, which are all very slight variations of each other. Common arguments are as follows:

  • learner: This is where you pass your predictive modeling function; in our case a Support Vector Machine, so e1071::svm from the e1071 package.

  • scorer: This is a named list of metric functions that will evaluate the model’s predictive performance. Each metric function should have two arguments, truth and estimate that intake the true outcome values and the predicted outcome values. It should output a scalar numeric score. The yardstick package provides a wide array of these metric functions that should cover most common cases. E.g. for the RMSE of a regression, scorer = list(rmse = yarstick::rmse_vec).

  • learner_args: This is a named list of function arguments that get passed directly to the learner function. For example, the e1071::svm function takes a type argument specifying whether it is a regression or classification task. You could specify a classification task as learner_args = list(type = "classification").

  • scorer_args: This is a named list of function arguments to pass to the scorer functions in scorer. This list should have one element per element in scorer. E.g. if scorer = list(rmse = rmse_vec, mae = mae_vec) then scorer_args = list(rmse = list(...), mae = list(...)).

  • prediction_args: This is similar to learner_args. It’s a named list of function arguments passed to the predict method. E.g. our SVM learner has a predict argument called probability whether to predict outcome classes or class probabilities. Specify class probabilities as prediction_args = list(probability = TRUE). Similar to scorer_args, this list should have one element per element in scorer.

  • convert_predictions: A named list of functions to transform the output of predict(...) into a vector of predictions. By default, the model’s predicted values may not always be a vector. E.g. predict(svm_model, probability = TRUE) returns a matrix with class probabilities for both classes, while predict(svm_model, probability = FALSE) returns a vector. For calculating model accuracy you need class predictions, while ROC AUC requires class probabilities. Suppose that scorer = list(accuracy = accuracy_vec, auc = roc_auc_vec). To ensure that accuracy gets class predictions and ROC AUC gets class probabilities, you can provide the corresponding prediction arguments prediction_args = list(accuracy = NULL, auc = list(probability = TRUE)) and then convert those predictions into a vector convert_predictions = list(accuracy = NULL, auc = function(.x) attr(.x, "probabilities")[, "FALSE"]).

Cross validation arguments

The following arguments are specific to the CV cross validation class.

Grid search arguments

The following arguments are specific to the GridSearch grid search class.

Examples

We’ll show simple examples of each of CV, GridSearch and GridSearchCV.

CV

iris_cv <- CV$new(
  learner = svm,
  learner_args = list(type = "C-classification", probability = TRUE),
  splitter = cv_split,
  splitter_args = list(v = 3),
  scorer = list(roc_auc = roc_auc_vec), 
  prediction_args = list(roc_auc = list(probability = TRUE)),
  convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"])
)

# Fit cross validated model
iris_cv_fitted <- iris_cv$fit(formula = Species ~ ., data = iris_new)

iris_cv_fitted$mean_metrics
# $roc_auc
# [1] 0.9984446

GridSearch

iris_new_train <- iris_new[1:1000, ]
iris_new_eval <- iris_new[1000:nrow(iris_new), ]

iris_grid <- GridSearch$new(
  learner = svm,
  tune_params = list(
    cost = c(0.01, 0.1, 0.5, 1, 3, 6),
    kernel = c("polynomial", "radial", "sigmoid")
  ),
  learner_args = list(type = "C-classification", probability = TRUE),
  evaluation_data = list(x = iris_new_eval[, -5], y = iris_new_eval[, 5]),
  scorer = list(roc_auc = roc_auc_vec), 
  prediction_args = list(roc_auc = list(probability = TRUE)),
  convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"]),
  optimize_score = "max"
)

# Fit cross validated model
iris_grid_fitted <- iris_grid$fit(formula = Species ~ ., data = iris_new)

iris_grid_fitted$best_params
# $cost
# [1] 6
# 
# $kernel
# [1] "polynomial"

GridSearchCV

iris_grid <- GridSearchCV$new(
  learner = svm,
  tune_params = list(
    cost = c(0.01, 0.1, 0.5, 1, 3, 6),
    kernel = c("polynomial", "radial", "sigmoid")
  ),
  learner_args = list(type = "C-classification", probability = TRUE),
  splitter = cv_split,
  splitter_args = list(v = 3),
  scorer = list(roc_auc = roc_auc_vec), 
  prediction_args = list(roc_auc = list(probability = TRUE)),
  convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"]),
  optimize_score = "max"
)

# Fit cross validated model
iris_grid_fitted <- iris_grid$fit(formula = Species ~ ., data = iris_new)

iris_grid_fitted$best_params
# $cost
# [1] 6
# 
# $kernel
# [1] "polynomial"