tidyfit
is an R
package that facilitates
and automates linear and nonlinear regression and classification
modeling in a tidy environment. The package includes methods such as the
Lasso, PLS, timevarying parameter or Bayesian model averaging
regressions, and many more. The aim is threefold:
caret
, the aim is not to cover a comprehensive set of
available packages, but rather a curated list of the most useful
ones.tidy
data,
including verbs for regression and classification, automatic
consideration of grouped data, and tidy
output formats
throughout.tidyfit
builds on the tidymodels
suite, but
emphasizes automated modeling with a focus on grouped data, model
comparisons, and highvolume analytics. The objective is to make model
fitting, cross validation and model output very simple and standardized
across all methods, with many necessary methodspecific transformations
handled in the background.
tidyfit
can be installed from CRAN or the development
version from GitHub
with:
# CRAN
install.packages("tidyfit")
# Dev version
# install.packages("devtools")
::install_github("jpfitzinger/tidyfit")
devtools
library(tidyfit)
tidyfit
includes 3 deceptively simple functions:
regress()
classify()
m()
All 3 of these functions return a tidyfit.models
frame,
which is a data frame containing information about fitted regression and
classification models. regress
and classify
perform regression and classification on tidy data. The functions ingest
a tibble
, prepare input data for the models by splitting
groups, partitioning cross validation slices and handling any necessary
adjustments and transformations. The data is ultimately passed to the
model wrapper m()
which fits the models.
To illustrate basic usage, suppose we would like to fit a financial factor regression for 10 industries with exponential weighting, comparing a WLS and a weighted LASSO regression:
::handlers(global=TRUE)
progressr
::Factor_Industry_Returns %>%
tidyfitgroup_by(Industry) %>% # Ensures that a model is fitted for each industry
mutate(Weight = 0.96^seq(n(), 1)) %>% # Exponential weights
# 'regress' allows flexible standardized regression analysis in a single line of code
regress(
~ ., # Uses normal formula syntax
Return m("lasso"), # LASSO regression wrapper, 'lambda' grid set to default
m("lm"), # OLS wrapper (can add as many wrappers as necessary here)
.cv = "initial_time_split", # Crossvalidation method for optimal 'lambda' in LASSO
.mask = "Date", # 'Date' columns should be excluded
.weights = "Weight" # Specifies the weights column
> models_df
)
# Get coefficients frame
coef(models_df)
The syntax is identical for classify
.
m
is a powerful wrapper for many different regression
and classification techniques that can be used with regress
and classify
, or as a standalone function:
m(
<method>, # e.g. "lm" or "lasso"
# not passed when used within regress or classify
formula, data, # Args passed to underlying method, e.g. to stats::lm for OLS regression
... )
An important feature of m()
is that all arguments can be
passed as vectors, allowing generalized hyperparameter tuning or
scenario analysis for any method:
m("lasso", lambda = seq(0, 1, by = 0.1))
m("robust", method = c("M", "MM"))
m("subset", method = c("forward", "backward"))
m("glm", family = list(binomial(link="logit"), binomial(link="probit")))
Arguments that are meant to be vectors (e.g. weights) are recognized by the function and not interpreted as grids.
tidyfit
tidyfit
currently implements the following methods:
Method  Name  Package  Regression  Classification 

Linear (generalized) regression or classification  
OLS  lm 
stats

yes  no 
GLS  glm 
stats

yes  yes 
Robust regression (e.g. Huber loss)  robust 
MASS

yes  no 
Quantile regression  quantile 
quantreg

yes  no 
ANOVA  anova 
stats

yes  yes 
Regression or classification with L1 and L2 penalties  
LASSO  lasso 
glmnet

yes  yes 
Ridge  ridge 
glmnet

yes  yes 
Group LASSO  group_lasso 
gglasso

yes  yes 
Adaptive LASSO  adalasso 
glmnet

yes  yes 
ElasticNet  enet 
glmnet

yes  yes 
Machine Learning  
Gradient boosting regression  boost 
mboost

yes  yes 
Support vector machine  svm 
e1071

yes  yes 
Random forest  rf 
randomForest

yes  yes 
Quantile random forest  quantile_rf 
quantregForest

yes  yes 
Neural Network  nnet 
nnet

yes  no 
Factor regressions  
Principal components regression  pcr 
pls

yes  no 
Partial least squares  plsr 
pls

yes  no 
Hierarchical feature regression  hfr 
hfr

yes  no 
Subset selection  
Best subset selection  subset 
bestglm

yes  yes 
Genetic Algorithm  genetic 
gaselect

yes  no 
Generaltospecific  gets 
gets

yes  no 
Bayesian regression  
Bayesian regression  bayes 
arm

yes  yes 
Bayesian Ridge  bridge 
monomvn

yes  no 
Bayesian Lasso  blasso 
monomvn

yes  no 
Bayesian model averaging  bma 
BMS

yes  no 
Bayesian Spike and Slab  spikeslab 
BoomSpikeSlab

yes  yes 
Bayesian timevarying parameters regression  tvp 
shrinkTVP

yes  no 
Mixedeffects modeling  
Generalized mixedeffects regression  glmm 
lme4

yes  yes 
Specialized time series methods  
Markovswitching regression  mslm 
MSwM

yes  no 
Feature selection  
Pearson correlation  cor 
stats

no  no 
Chisquared test  chisq 
stats

yes  yes 
Minimum redundancy, maximum relevance  mrmr 
mRMRe

yes  yes 
ReliefF  relief 
CORElearn

yes  yes 
See ?m
for additional information.
It is important to note that the above list is not complete, since
some of the methods encompass multiple algorithms. For instance,
“subset” can be used to perform forward, backward or exhaustive search
selection using leaps
. Similarly, “lasso” includes certain
grouped LASSO implementations that can be fitted with
glmnet
.
In this section, a minimal workflow is used to demonstrate how the package works. For more detailed guides of specialized topics, or simply for further reading, follow these links:
tidyfit
includes a data set of financial FamaFrench
factor returns freely available here.
The data set includes monthly industry returns for 10 industries, as
well as monthly factor returns for 5 factors:
< tidyfit::Factor_Industry_Returns
data
# Calculate excess return
< data %>%
data mutate(Return = Return  RF) %>%
select(RF)
data#> # A tibble: 7,080 × 8
#> Date Industry Return `MktRF` SMB HML RMW CMA
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 196307 NoDur 0.76 0.39 0.44 0.89 0.68 1.23
#> 2 196308 NoDur 4.64 5.07 0.75 1.68 0.36 0.34
#> 3 196309 NoDur 1.96 1.57 0.55 0.08 0.71 0.29
#> 4 196310 NoDur 2.36 2.53 1.37 0.14 2.8 2.02
#> 5 196311 NoDur 1.4 0.85 0.89 1.81 0.51 2.31
#> 6 196312 NoDur 2.52 1.83 2.07 0.08 0.03 0.04
#> 7 196401 NoDur 0.49 2.24 0.11 1.47 0.17 1.51
#> 8 196402 NoDur 1.61 1.54 0.3 2.74 0.05 0.9
#> 9 196403 NoDur 2.77 1.41 1.36 3.36 2.21 3.19
#> 10 196404 NoDur 0.77 0.1 1.59 0.58 1.27 1.04
#> # ℹ 7,070 more rows
We will only use a subset of the data to keep things simple:
< data %>%
df_train filter(Date > 201800 & Date < 202000)
< data %>%
df_test filter(Date >= 202000)
Before beginning with the estimation, we can activate the progress
bar visualization. This allows us to gauge estimation progress along the
way. tidyfit
uses the progressr
package
internally to generate a progress bar — run
progressr::handlers(global=TRUE)
to activate progress bars
in your environment.
For purposes of this demonstration, the objective will be to fit an
ElasticNet regression for each industry group, and compare results to a
robust least squares regression. This can be done with
regress
after grouping the data. For grouped data, the
functions regress
and classify
estimate models
for each group independently:
< df_train %>%
model_frame group_by(Industry) %>%
regress(Return ~ .,
m("enet"),
m("robust", method = "MM", psi = MASS::psi.huber),
.cv = "initial_time_split", .mask = "Date")
Note that the penalty and mixing parameters are chosen automatically
using a time series train/test split
(rsample::initial_time_split
), with a hyperparameter grid
set by the package dials
. See ?regress
for
additional CV methods. A custom grid can easily be passed using
lambda = c(...)
and/or alpha = c(...)
in the
m("enet")
wrapper.
The resulting tidyfit.models
frame consists of 3
components:
Industry
column, the model name, and grid ID (the ID for
the model settings used).estimator_fct
, size (MB)
and
model_object
columns. These columns contain information on
the model itself. The model_object
is the fitted
tidyFit
model (an R6
class that contains the
model object as well as any additional information needed to perform
predictions or access coefficients)settings
, as well as (if applicable)
messages
and warnings
.< model_frame %>%
subset_mod_frame filter(Industry %in% unique(Industry)[1:2])
subset_mod_frame#> # A tibble: 4 × 8
#> Industry model estimator_fct `size (MB)` grid_id model_object settings
#> <chr> <chr> <chr> <dbl> <chr> <list> <list>
#> 1 Enrgy enet glmnet::glmnet 1.21 #001100 <tidyFit> <tibble>
#> 2 Shops enet glmnet::glmnet 1.21 #001049 <tidyFit> <tibble>
#> 3 Enrgy robust MASS::rlm 0.0638 #0010000 <tidyFit> <tibble>
#> 4 Shops robust MASS::rlm 0.0638 #0010000 <tidyFit> <tibble>
#> # ℹ 1 more variable: warnings <chr>
Let’s unnest the settings columns:
%>%
subset_mod_frame ::unnest(settings, keep_empty = TRUE)
tidyr#> # A tibble: 4 × 13
#> Industry model estimator_fct `size (MB)` grid_id model_object alpha weights
#> <chr> <chr> <chr> <dbl> <chr> <list> <dbl> <list>
#> 1 Enrgy enet glmnet::glmnet 1.21 #001100 <tidyFit> 0 <NULL>
#> 2 Shops enet glmnet::glmnet 1.21 #001049 <tidyFit> 0 <NULL>
#> 3 Enrgy robust MASS::rlm 0.0638 #0010000 <tidyFit> NA <NULL>
#> 4 Shops robust MASS::rlm 0.0638 #0010000 <tidyFit> NA <NULL>
#> # ℹ 5 more variables: family <chr>, lambda <dbl>, method <chr>, psi <list>,
#> # warnings <chr>
The tidyfit.models
frame can be used to access
additional information. Specifically, we can do 4 things:
The fitted tidyFit models are stored as an
R6
class in the model_object
column and can be
addressed directly with generics such as coef
or
summary
. The underlying object (e.g. an lm
class fitted model) is given in ...$object
(see here
for another example):
%>%
subset_mod_frame mutate(fitted_model = map(model_object, ~.$object))
#> # A tibble: 4 × 9
#> Industry model estimator_fct `size (MB)` grid_id model_object settings
#> <chr> <chr> <chr> <dbl> <chr> <list> <list>
#> 1 Enrgy enet glmnet::glmnet 1.21 #001100 <tidyFit> <tibble>
#> 2 Shops enet glmnet::glmnet 1.21 #001049 <tidyFit> <tibble>
#> 3 Enrgy robust MASS::rlm 0.0638 #0010000 <tidyFit> <tibble>
#> 4 Shops robust MASS::rlm 0.0638 #0010000 <tidyFit> <tibble>
#> # ℹ 2 more variables: warnings <chr>, fitted_model <list>
To predict, we need data with the same columns as
the input data and simply use the generic predict
function.
Groups are respected and if the response variable is in the data, it is
included as a truth
column in the resulting object:
predict(subset_mod_frame, data)
#> # A tibble: 2,832 × 4
#> # Groups: Industry, model [4]
#> Industry model prediction truth
#> <chr> <chr> <dbl> <dbl>
#> 1 Enrgy enet 3.77 2.02
#> 2 Enrgy enet 3.90 3.69
#> 3 Enrgy enet 2.40 3.91
#> 4 Enrgy enet 3.30 0.61
#> 5 Enrgy enet 0.762 1.43
#> 6 Enrgy enet 0.293 4.36
#> 7 Enrgy enet 3.68 4.54
#> 8 Enrgy enet 2.31 0.8
#> 9 Enrgy enet 7.26 1.09
#> 10 Enrgy enet 2.53 3.74
#> # ℹ 2,822 more rows
Finally, we can obtain a tidy frame of the
coefficients using the generic coef
function:
< coef(subset_mod_frame)
estimates
estimates#> # A tibble: 24 × 5
#> # Groups: Industry, model [4]
#> Industry model term estimate model_info
#> <chr> <chr> <chr> <dbl> <list>
#> 1 Enrgy enet (Intercept) 0.955 <tibble [1 × 4]>
#> 2 Enrgy enet MktRF 1.20 <tibble [1 × 4]>
#> 3 Enrgy enet SMB 0.703 <tibble [1 × 4]>
#> 4 Enrgy enet HML 0.00208 <tibble [1 × 4]>
#> 5 Enrgy enet RMW 0.622 <tibble [1 × 4]>
#> 6 Enrgy enet CMA 1.32 <tibble [1 × 4]>
#> 7 Shops enet (Intercept) 1.03 <tibble [1 × 4]>
#> 8 Shops enet MktRF 0.0849 <tibble [1 × 4]>
#> 9 Shops enet SMB 0.0353 <tibble [1 × 4]>
#> 10 Shops enet HML 0.0149 <tibble [1 × 4]>
#> # ℹ 14 more rows
The estimates contain additional methodspecific information that is
nested in model_info
. This can include standard errors,
tvalues and similar information:
::unnest(estimates, model_info)
tidyr#> # A tibble: 24 × 8
#> # Groups: Industry, model [4]
#> Industry model term estimate lambda dev.ratio std.error statistic
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Enrgy enet (Intercept) 0.955 0.536 0.812 NA NA
#> 2 Enrgy enet MktRF 1.20 0.536 0.812 NA NA
#> 3 Enrgy enet SMB 0.703 0.536 0.812 NA NA
#> 4 Enrgy enet HML 0.00208 0.536 0.812 NA NA
#> 5 Enrgy enet RMW 0.622 0.536 0.812 NA NA
#> 6 Enrgy enet CMA 1.32 0.536 0.812 NA NA
#> 7 Shops enet (Intercept) 1.03 50.1 0.173 NA NA
#> 8 Shops enet MktRF 0.0849 50.1 0.173 NA NA
#> 9 Shops enet SMB 0.0353 50.1 0.173 NA NA
#> 10 Shops enet HML 0.0149 50.1 0.173 NA NA
#> # ℹ 14 more rows
Additional generics such as fitted
or resid
can be used to obtain more information on the models.
Suppose we would like to evaluate the relative performance of the two
methods. This becomes exceedingly simple using the
yardstick
package:
%>%
model_frame # Generate predictions
predict(df_test) %>%
# Calculate RMSE
::rmse(truth, prediction) %>%
yardstick# Plot
ggplot(aes(Industry, .estimate)) +
geom_col(aes(fill = model), position = position_dodge())
The ElasticNet performs a little better (unsurprising really, given the small data set).
A more detailed regression analysis of Boston house price data using a panel of regularized regression estimators can be found here.