recipes can assign one or more roles to each column in
the data. The roles are not restricted to a predefined set; they can be
anything. For most conventional situations, they are typically
“predictor” and/or “outcome”. Additional roles enable targeted step
operations on specific variables or groups of variables.
When a recipe is created using the formula interface, this defines
the roles for all columns of the data set. summary() can be
used to view a tibble containing information regarding the roles.
library(recipes)
recipe(Species ~ ., data = iris) |> summary()
#> # A tibble: 6 × 4
#>   variable     type      role      source  
#>   <chr>        <list>    <chr>     <chr>   
#> 1 Sepal.Length <chr [2]> predictor original
#> 2 Sepal.Width  <chr [2]> predictor original
#> 3 Petal.Length <chr [2]> predictor original
#> 4 Petal.Width  <chr [2]> predictor original
#> 5 original     <chr [3]> predictor original
#> 6 Species      <chr [3]> outcome   original
recipe( ~ Species, data = iris) |> summary()
#> # A tibble: 1 × 4
#>   variable type      role      source  
#>   <chr>    <list>    <chr>     <chr>   
#> 1 Species  <chr [3]> predictor original
recipe(Sepal.Length + Sepal.Width ~ ., data = iris) |> summary()
#> # A tibble: 6 × 4
#>   variable     type      role      source  
#>   <chr>        <list>    <chr>     <chr>   
#> 1 Petal.Length <chr [2]> predictor original
#> 2 Petal.Width  <chr [2]> predictor original
#> 3 Species      <chr [3]> predictor original
#> 4 original     <chr [3]> predictor original
#> 5 Sepal.Length <chr [2]> outcome   original
#> 6 Sepal.Width  <chr [2]> outcome   originalThese roles can be updated despite this initial assignment.
update_role() can modify a single existing role:
library(modeldata)
data(biomass)
recipe(HHV ~ ., data = biomass) |> 
  update_role(dataset, new_role = "dataset split variable") |> 
  update_role(sample, new_role = "sample ID") |> 
  summary()
#> # A tibble: 8 × 4
#>   variable type      role                   source  
#>   <chr>    <list>    <chr>                  <chr>   
#> 1 sample   <chr [3]> sample ID              original
#> 2 dataset  <chr [3]> dataset split variable original
#> 3 carbon   <chr [2]> predictor              original
#> 4 hydrogen <chr [2]> predictor              original
#> 5 oxygen   <chr [2]> predictor              original
#> 6 nitrogen <chr [2]> predictor              original
#> 7 sulfur   <chr [2]> predictor              original
#> 8 HHV      <chr [2]> outcome                originalWhen you want to get rid of a role for a column, use
remove_role().
recipe(HHV ~ ., data = biomass) |> 
  remove_role(sample, old_role = "predictor") |> 
  summary()
#> # A tibble: 8 × 4
#>   variable type      role      source  
#>   <chr>    <list>    <chr>     <chr>   
#> 1 sample   <chr [3]> <NA>      original
#> 2 dataset  <chr [3]> predictor original
#> 3 carbon   <chr [2]> predictor original
#> 4 hydrogen <chr [2]> predictor original
#> 5 oxygen   <chr [2]> predictor original
#> 6 nitrogen <chr [2]> predictor original
#> 7 sulfur   <chr [2]> predictor original
#> 8 HHV      <chr [2]> outcome   originalIt represents the lack of a role as NA, which means that
the variable is used in the recipe, but does not yet have a declared
role. Setting the role manually to NA is not allowed:
recipe(HHV ~ ., data = biomass) |> 
  update_role(sample, new_role = NA_character_)
#> Error in `update_role()`:
#> ! `new_role` must be a single string, not a character `NA`.When there are cases when a column will be used in more than one
context, add_role() can create additional roles:
multi_role <- recipe(HHV ~ ., data = biomass) |> 
  update_role(dataset, new_role = "dataset split variable") |> 
  update_role(sample, new_role = "sample ID") |> 
  # Roles below from https://wordcounter.net/random-word-generator
  add_role(sample, new_role = "jellyfish") 
multi_role |> 
  summary()
#> # A tibble: 9 × 4
#>   variable type      role                   source  
#>   <chr>    <list>    <chr>                  <chr>   
#> 1 sample   <chr [3]> sample ID              original
#> 2 sample   <chr [3]> jellyfish              original
#> 3 dataset  <chr [3]> dataset split variable original
#> 4 carbon   <chr [2]> predictor              original
#> 5 hydrogen <chr [2]> predictor              original
#> 6 oxygen   <chr [2]> predictor              original
#> 7 nitrogen <chr [2]> predictor              original
#> 8 sulfur   <chr [2]> predictor              original
#> 9 HHV      <chr [2]> outcome                originalIf a variable has multiple existing roles and you want to update one
of them, the additional old_role argument to
update_role() must be used to resolve any ambiguity.
multi_role |>
  update_role(sample, new_role = "flounder", old_role = "jellyfish") |>
  summary()
#> # A tibble: 9 × 4
#>   variable type      role                   source  
#>   <chr>    <list>    <chr>                  <chr>   
#> 1 sample   <chr [3]> sample ID              original
#> 2 sample   <chr [3]> flounder               original
#> 3 dataset  <chr [3]> dataset split variable original
#> 4 carbon   <chr [2]> predictor              original
#> 5 hydrogen <chr [2]> predictor              original
#> 6 oxygen   <chr [2]> predictor              original
#> 7 nitrogen <chr [2]> predictor              original
#> 8 sulfur   <chr [2]> predictor              original
#> 9 HHV      <chr [2]> outcome                originalAdditional variable roles allow you to use has_role() in
combination with other selection methods (see ?selections)
to target specific variables in subsequent processing steps. For
example, in the following recipe, by adding the role
"nocenter" to the HHV predictor, you can use
-has_role("nocenter") to exclude HHV when
centering all_predictors().
multi_role |> 
  add_role(HHV, new_role = "nocenter") |> 
  step_center(all_predictors(), -has_role("nocenter")) |> 
  prep(training = biomass, retain = TRUE) |> 
  bake(new_data = NULL) |> 
  head()
#> # A tibble: 6 × 8
#>   sample                 dataset  carbon hydrogen oxygen nitrogen  sulfur   HHV
#>   <chr>                  <chr>     <dbl>    <dbl>  <dbl>    <dbl>   <dbl> <dbl>
#> 1 Akhrot Shell           Training  1.52    0.181    4.37  -0.667  -0.234   20.0
#> 2 Alabama Oak Wood Waste Training  1.21    0.241    2.73  -0.877  -0.234   19.2
#> 3 Alder                  Training -0.475   0.341    7.68  -0.967  -0.214   18.3
#> 4 Alfalfa                Training -3.19   -0.489   -2.97   2.22   -0.0736  18.2
#> 5 Alfalfa Seed Straw     Training -1.53   -0.0586   2.15  -0.0772 -0.214   18.4
#> 6 Alfalfa Stalks         Training -2.89    0.291    1.63   0.963  -0.134   18.5The selector all_numeric_predictors() can also be used
in place of the compound specification above.
You can start a recipe without any roles:
recipe(biomass) |> 
  summary()
#> # A tibble: 8 × 4
#>   variable type      role  source  
#>   <chr>    <list>    <chr> <chr>   
#> 1 sample   <chr [3]> <NA>  original
#> 2 dataset  <chr [3]> <NA>  original
#> 3 carbon   <chr [2]> <NA>  original
#> 4 hydrogen <chr [2]> <NA>  original
#> 5 oxygen   <chr [2]> <NA>  original
#> 6 nitrogen <chr [2]> <NA>  original
#> 7 sulfur   <chr [2]> <NA>  original
#> 8 HHV      <chr [2]> <NA>  originaland roles can be added in bulk as needed:
recipe(biomass) |> 
  update_role(contains("gen"), new_role = "lunchroom") |> 
  update_role(sample, HHV, new_role = "snail") |> 
  summary()
#> # A tibble: 8 × 4
#>   variable type      role      source  
#>   <chr>    <list>    <chr>     <chr>   
#> 1 sample   <chr [3]> snail     original
#> 2 dataset  <chr [3]> <NA>      original
#> 3 carbon   <chr [2]> <NA>      original
#> 4 hydrogen <chr [2]> lunchroom original
#> 5 oxygen   <chr [2]> lunchroom original
#> 6 nitrogen <chr [2]> lunchroom original
#> 7 sulfur   <chr [2]> <NA>      original
#> 8 HHV      <chr [2]> snail     originalAll recipes steps have a role argument that lets you set
the role of new columns generated by the step. When a recipe
modifies a column in-place, the role is never modified. For example,
?step_center has the documentation:
role: Not used by this step since no new variables are created
In other cases, the roles are defaulted to a relevant value based the
context. For example, ?step_dummy has
role: For model terms created by this step, what analysis role should they be assigned?. By default, the function assumes that the binary dummy variable columns created by the original variables will be used as predictors in a model.
So, by default, they are predictors but don’t have to be:
recipe( ~ ., data = iris) |> 
  step_dummy(Species) |> 
  prep() |> 
  bake(new_data = NULL, all_predictors()) |> 
  dplyr::select(starts_with("Species")) |> 
  names()
#> [1] "Species_versicolor" "Species_virginica"
# or something else
recipe( ~ ., data = iris) |> 
  step_dummy(Species, role = "trousers") |> 
  prep() |> 
  bake(new_data = NULL, has_role("trousers")) |> 
  names()
#> [1] "Species_versicolor" "Species_virginica"