Scaling and Modeling with tidyquant

2023-03-31

Designed for the data science workflow of the tidyverse

Overview

The greatest benefit to tidyquant is the ability to apply the data science workflow to easily model and scale your financial analysis as described in R for Data Science. Scaling is the process of creating an analysis for one asset and then extending it to multiple groups. This idea of scaling is incredibly useful to financial analysts because typically one wants to compare many assets to make informed decisions. Fortunately, the tidyquant package integrates with the tidyverse making scaling super simple!

All tidyquant functions return data in the tibble (tidy data frame) format, which allows for interaction within the tidyverse. This means we can:

• Seamlessly scale data retrieval and mutations
• Use the pipe (%>%) for chaining operations
• Use dplyr and tidyr: select, filter, group_by, nest/unnest, spread/gather, etc
• Use purrr: mapping functions with map
• Model financial analysis using the data science workflow in R for Data Science

We’ll go through some useful techniques for getting and manipulating groups of data.

Prerequisites

Load the tidyquant package to get started.

# Loads tidyquant, lubridate, xts, quantmod, TTR, and PerformanceAnalytics
library(tidyverse)
library(tidyquant)  

1.0 Scaling the Getting of Financial Data

A very basic example is retrieving the stock prices for multiple stocks. There are three primary ways to do this:

Method 1: Map a character vector with multiple stock symbols

c("AAPL", "GOOG", "FB") %>%
tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01")
## # A tibble: 504 x 8
##    symbol date        open  high   low close    volume adjusted
##    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
##  1 AAPL   2016-01-04  25.7  26.3  25.5  26.3 270597600     24.1
##  2 AAPL   2016-01-05  26.4  26.5  25.6  25.7 223164000     23.5
##  3 AAPL   2016-01-06  25.1  25.6  25.0  25.2 273829600     23.0
##  4 AAPL   2016-01-07  24.7  25.0  24.1  24.1 324377600     22.0
##  5 AAPL   2016-01-08  24.6  24.8  24.2  24.2 283192000     22.2
##  6 AAPL   2016-01-11  24.7  24.8  24.3  24.6 198957600     22.5
##  7 AAPL   2016-01-12  25.1  25.2  24.7  25.0 196616800     22.8
##  8 AAPL   2016-01-13  25.1  25.3  24.3  24.3 249758400     22.3
##  9 AAPL   2016-01-14  24.5  25.1  23.9  24.9 252680400     22.7
## 10 AAPL   2016-01-15  24.0  24.4  23.8  24.3 319335600     22.2
## # ... with 494 more rows

The output is a single level tibble with all or the stock prices in one tibble. The auto-generated column name is “symbol”, which can be pre-emptively renamed by giving the vector a name (e.g. stocks <- c("AAPL", "GOOG", "FB")) and then piping to tq_get.

Method 2: Map a tibble with stocks in first column

First, get a stock list in data frame format either by making the tibble or retrieving from tq_index / tq_exchange. The stock symbols must be in the first column.

Method 2A: Make a tibble

stock_list <- tibble(stocks = c("AAPL", "JPM", "CVX"),
industry = c("Technology", "Financial", "Energy"))
stock_list
## # A tibble: 3 x 2
##   stocks industry
##   <chr>  <chr>
## 1 AAPL   Technology
## 2 JPM    Financial
## 3 CVX    Energy

Second, send the stock list to tq_get. Notice how the symbol and industry columns are automatically expanded the length of the stock prices.

stock_list %>%
tq_get(get = "stock.prices", from = "2016-01-01", to = "2017-01-01")
## # A tibble: 756 x 9
##    stocks industry   date        open  high   low close    volume adjusted
##    <chr>  <chr>      <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
##  1 AAPL   Technology 2016-01-04  25.7  26.3  25.5  26.3 270597600     24.1
##  2 AAPL   Technology 2016-01-05  26.4  26.5  25.6  25.7 223164000     23.5
##  3 AAPL   Technology 2016-01-06  25.1  25.6  25.0  25.2 273829600     23.0
##  4 AAPL   Technology 2016-01-07  24.7  25.0  24.1  24.1 324377600     22.0
##  5 AAPL   Technology 2016-01-08  24.6  24.8  24.2  24.2 283192000     22.2
##  6 AAPL   Technology 2016-01-11  24.7  24.8  24.3  24.6 198957600     22.5
##  7 AAPL   Technology 2016-01-12  25.1  25.2  24.7  25.0 196616800     22.8
##  8 AAPL   Technology 2016-01-13  25.1  25.3  24.3  24.3 249758400     22.3
##  9 AAPL   Technology 2016-01-14  24.5  25.1  23.9  24.9 252680400     22.7
## 10 AAPL   Technology 2016-01-15  24.0  24.4  23.8  24.3 319335600     22.2
## # ... with 746 more rows

Method 2B: Use index or exchange

Get an index…

tq_index("DOW")
## # A tibble: 30 x 8
##    symbol company                    ident~1 sedol weight sector share~2 local~3
##    <chr>  <chr>                      <chr>   <chr>  <dbl> <chr>    <dbl> <chr>
##  1 UNH    UnitedHealth Group Incorp~ 91324P~ 2917~ 0.0943 Healt~ 5616168 USD
##  2 GS     Goldman Sachs Group Inc.   38141G~ 2407~ 0.0644 Finan~ 5616168 USD
##  3 HD     Home Depot Inc.            437076~ 2434~ 0.0572 Consu~ 5616168 USD
##  4 MSFT   Microsoft Corporation      594918~ 2588~ 0.0570 Infor~ 5616168 USD
##  5 MCD    McDonald's Corporation     580135~ 2550~ 0.0557 Consu~ 5616168 USD
##  6 AMGN   Amgen Inc.                 031162~ 2023~ 0.0484 Healt~ 5616168 USD
##  7 CAT    Caterpillar Inc.           149123~ 2180~ 0.0450 Indus~ 5616168 USD
##  8 V      Visa Inc. Class A          92826C~ B2PZ~ 0.0446 Finan~ 5616168 USD
##  9 BA     Boeing Company             097023~ 2108~ 0.0423 Indus~ 5616168 USD
## 10 CRM    Salesforce Inc.            79466L~ 2310~ 0.0394 Infor~ 5616168 USD
## # ... with 20 more rows, and abbreviated variable names 1: identifier,
## #   2: shares_held, 3: local_currency

…or, get an exchange.

tq_exchange("NYSE")

Send the index or exchange to tq_get. Important Note: This can take several minutes depending on the size of the index or exchange, which is why only the first three stocks are evaluated in the vignette.

tq_index("DOW") %>%
slice(1:3) %>%
tq_get(get = "stock.prices")
## # A tibble: 7,737 x 15
##    symbol company   ident~1 sedol weight sector share~2 local~3 date        open
##    <chr>  <chr>     <chr>   <chr>  <dbl> <chr>    <dbl> <chr>   <date>     <dbl>
##  1 UNH    UnitedHe~ 91324P~ 2917~ 0.0943 Healt~ 5616168 USD     2013-01-02  54.9
##  2 UNH    UnitedHe~ 91324P~ 2917~ 0.0943 Healt~ 5616168 USD     2013-01-03  53.4
##  3 UNH    UnitedHe~ 91324P~ 2917~ 0.0943 Healt~ 5616168 USD     2013-01-04  52.2
##  4 UNH    UnitedHe~ 91324P~ 2917~ 0.0943 Healt~ 5616168 USD     2013-01-07  51.9
##  5 UNH    UnitedHe~ 91324P~ 2917~ 0.0943 Healt~ 5616168 USD     2013-01-08  52
##  6 UNH    UnitedHe~ 91324P~ 2917~ 0.0943 Healt~ 5616168 USD     2013-01-09  51.6
##  7 UNH    UnitedHe~ 91324P~ 2917~ 0.0943 Healt~ 5616168 USD     2013-01-10  52.5
##  8 UNH    UnitedHe~ 91324P~ 2917~ 0.0943 Healt~ 5616168 USD     2013-01-11  53.4
##  9 UNH    UnitedHe~ 91324P~ 2917~ 0.0943 Healt~ 5616168 USD     2013-01-14  52.9
## 10 UNH    UnitedHe~ 91324P~ 2917~ 0.0943 Healt~ 5616168 USD     2013-01-15  53.2
## # ... with 7,727 more rows, 5 more variables: high <dbl>, low <dbl>,
## #   close <dbl>, volume <dbl>, adjusted <dbl>, and abbreviated variable names
## #   1: identifier, 2: shares_held, 3: local_currency

You can use any applicable “getter” to get data for every stock in an index or an exchange! This includes: “stock.prices”, “key.ratios”, “key.stats”, and more.

2.0 Scaling the Mutation of Financial Data

Once you get the data, you typically want to do something with it. You can easily do this at scale. Let’s get the yearly returns for multiple stocks using tq_transmute. First, get the prices. We’ll use the FANG data set, but you typically will use tq_get to retrieve data in “tibble” format.

data("FANG")

FANG
## # A tibble: 4,032 x 8
##    symbol date        open  high   low close    volume adjusted
##    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
##  1 FB     2013-01-02  27.4  28.2  27.4  28    69846400     28
##  2 FB     2013-01-03  27.9  28.5  27.6  27.8  63140600     27.8
##  3 FB     2013-01-04  28.0  28.9  27.8  28.8  72715400     28.8
##  4 FB     2013-01-07  28.7  29.8  28.6  29.4  83781800     29.4
##  5 FB     2013-01-08  29.5  29.6  28.9  29.1  45871300     29.1
##  6 FB     2013-01-09  29.7  30.6  29.5  30.6 104787700     30.6
##  7 FB     2013-01-10  30.6  31.5  30.3  31.3  95316400     31.3
##  8 FB     2013-01-11  31.3  32.0  31.1  31.7  89598000     31.7
##  9 FB     2013-01-14  32.1  32.2  30.6  31.0  98892800     31.0
## 10 FB     2013-01-15  30.6  31.7  29.9  30.1 173242600     30.1
## # ... with 4,022 more rows

Second, use group_by to group by stock symbol. Third, apply the mutation. We can do this in one easy workflow. The periodReturn function is applied to each group of stock prices, and a new data frame was returned with the annual returns in the correct periodicity.

FANG_returns_yearly <- FANG %>%
group_by(symbol) %>%
mutate_fun = periodReturn,
period     = "yearly",
col_rename = "yearly.returns") 

Last, we can visualize the returns.

FANG_returns_yearly %>%
ggplot(aes(x = year(date), y = yearly.returns, fill = symbol)) +
geom_bar(position = "dodge", stat = "identity") +
labs(title = "FANG: Annual Returns",
subtitle = "Mutating at scale is quick and easy!",
y = "Returns", x = "", color = "") +
scale_y_continuous(labels = scales::percent) +
coord_flip() +
theme_tq() +
scale_fill_tq()

3.0 Modeling Financial Data using purrr

Eventually you will want to begin modeling (or more generally applying functions) at scale! One of the best features of the tidyverse is the ability to map functions to nested tibbles using purrr. From the Many Models chapter of “R for Data Science”, we can apply the same modeling workflow to financial analysis. Using a two step workflow:

1. Model a single stock
2. Scale to many stocks

Let’s go through an example to illustrate.

Example: Applying a Regression Model to Detect a Positive Trend

In this example, we’ll use a simple linear model to identify the trend in annual returns to determine if the stock returns are decreasing or increasing over time.

Analyze a Single Stock

First, let’s collect stock data with tq_get()

AAPL <- tq_get("AAPL", from = "2007-01-01", to = "2016-12-31")
AAPL
## # A tibble: 2,518 x 8
##    symbol date        open  high   low close     volume adjusted
##    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>      <dbl>    <dbl>
##  1 AAPL   2007-01-03  3.08  3.09  2.92  2.99 1238319600     2.55
##  2 AAPL   2007-01-04  3.00  3.07  2.99  3.06  847260400     2.60
##  3 AAPL   2007-01-05  3.06  3.08  3.01  3.04  834741600     2.59
##  4 AAPL   2007-01-08  3.07  3.09  3.05  3.05  797106800     2.60
##  5 AAPL   2007-01-09  3.09  3.32  3.04  3.31 3349298400     2.81
##  6 AAPL   2007-01-10  3.38  3.49  3.34  3.46 2952880000     2.95
##  7 AAPL   2007-01-11  3.43  3.46  3.40  3.42 1440252800     2.91
##  8 AAPL   2007-01-12  3.38  3.40  3.33  3.38 1312690400     2.88
##  9 AAPL   2007-01-16  3.42  3.47  3.41  3.47 1244076400     2.95
## 10 AAPL   2007-01-17  3.48  3.49  3.39  3.39 1646260000     2.89
## # ... with 2,508 more rows

Next, come up with a function to help us collect annual log returns. The function below mutates the stock prices to period returns using tq_transmute(). We add the type = "log" and period = "monthly" arguments to ensure we retrieve a tibble of monthly log returns. Last, we take the mean of the monthly returns to get MMLR.

get_annual_returns <- function(stock.returns) {
stock.returns %>%
mutate_fun = periodReturn,
type       = "log",
period     = "yearly")
}

Let’s test get_annual_returns out. We now have the annual log returns over the past ten years.

AAPL_annual_log_returns <- get_annual_returns(AAPL)
AAPL_annual_log_returns
## # A tibble: 10 x 2
##    date       yearly.returns
##    <date>              <dbl>
##  1 2007-12-31         0.860
##  2 2008-12-31        -0.842
##  3 2009-12-31         0.904
##  4 2010-12-31         0.426
##  5 2011-12-30         0.228
##  6 2012-12-31         0.282
##  7 2013-12-31         0.0776
##  8 2014-12-31         0.341
##  9 2015-12-31        -0.0306
## 10 2016-12-30         0.118

Let’s visualize to identify trends. We can see from the linear trend line that AAPL’s stock returns are declining.

AAPL_annual_log_returns %>%
ggplot(aes(x = year(date), y = yearly.returns)) +
geom_hline(yintercept = 0, color = palette_light()[[1]]) +
geom_point(size = 2, color = palette_light()[[3]]) +
geom_line(size = 1, color = palette_light()[[3]]) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "AAPL: Visualizing Trends in Annual Returns",
x = "", y = "Annual Returns", color = "") +
theme_tq()

Now, we can get the linear model using the lm() function. However, there is one problem: the output is not “tidy”.

mod <- lm(yearly.returns ~ year(date), data = AAPL_annual_log_returns)
mod
##
## Call:
## lm(formula = yearly.returns ~ year(date), data = AAPL_annual_log_returns)
##
## Coefficients:
## (Intercept)   year(date)
##    58.86278     -0.02915

We can utilize the broom package to get “tidy” data from the model. There’s three primary functions:

1. augment: adds columns to the original data such as predictions, residuals and cluster assignments
2. glance: provides a one-row summary of model-level statistics
3. tidy: summarizes a model’s statistical findings such as coefficients of a regression

We’ll use tidy to retrieve the model coefficients.

library(broom)
tidy(mod)
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  58.9     113.         0.520   0.617
## 2 year(date)   -0.0291    0.0562    -0.518   0.618

Adding to our workflow, we have the following:

get_model <- function(stock_data) {
annual_returns <- get_annual_returns(stock_data)
mod <- lm(yearly.returns ~ year(date), data = annual_returns)
tidy(mod)
}

Testing it out on a single stock. We can see that the “term” that contains the direction of the trend (the slope) is “year(date)”. The interpetation is that as year increases one unit, the annual returns decrease by 3%.

get_model(AAPL)
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  58.9     113.         0.520   0.617
## 2 year(date)   -0.0291    0.0562    -0.518   0.618

Now that we have identified the trend direction, it looks like we are ready to scale.

Scale to Many Stocks

Once the analysis for one stock is done scale to many stocks is simple. For brevity, we’ll randomly sample ten stocks from the S&P500 with a call to dplyr::sample_n().

set.seed(10)
stocks_tbl <- tq_index("SP500") %>%
sample_n(5)
stocks_tbl
## # A tibble: 5 x 8
##   symbol company                    ident~1 sedol  weight sector share~2 local~3
##   <chr>  <chr>                      <chr>   <chr>   <dbl> <chr>    <dbl> <chr>
## 1 DXC    DXC Technology Co.         23355L~ BYXD~ 1.69e-4 Infor~ 2494036 USD
## 2 GD     General Dynamics Corporat~ 369550~ 2365~ 1.51e-3 Indus~ 2455028 USD
## 3 CNP    CenterPoint Energy Inc.    15189T~ 2440~ 5.46e-4 Utili~ 6867870 USD
## 4 AES    AES Corporation            00130H~ 2002~ 4.65e-4 Utili~ 7286997 USD
## 5 TAP    Molson Coors Beverage Com~ 60871R~ B067~ 2.85e-4 Consu~ 2038444 USD
## # ... with abbreviated variable names 1: identifier, 2: shares_held,
## #   3: local_currency

We can now apply our analysis function to the stocks using dplyr::mutate and purrr::map. The mutate() function adds a column to our tibble, and the map() function maps our custom get_model function to our tibble of stocks using the symbol column. The tidyr::unnest function unrolls the nested data frame so all of the model statistics are accessable in the top data frame level. The filter, arrange and select steps just manipulate the data frame to isolate and arrange the data for our viewing.

stocks_model_stats <- stocks_tbl %>%
select(symbol, company) %>%
tq_get(from = "2007-01-01", to = "2016-12-31") %>%

# Nest
group_by(symbol, company) %>%
nest() %>%

# Apply the get_model() function to the new "nested" data column
mutate(model = map(data, get_model)) %>%

# Unnest and collect slope
unnest(model) %>%
filter(term == "year(date)") %>%
arrange(desc(estimate)) %>%
select(-term)

stocks_model_stats
## # A tibble: 5 x 7
## # Groups:   symbol, company [5]
##   symbol company                        data     estim~1 std.e~2 stati~3 p.value
##   <chr>  <chr>                          <list>     <dbl>   <dbl>   <dbl>   <dbl>
## 1 DXC    DXC Technology Co.             <tibble>  0.0655  0.0440   1.49    0.175
## 2 GD     General Dynamics Corporation   <tibble>  0.0329  0.0243   1.35    0.213
## 3 AES    AES Corporation                <tibble>  0.0318  0.0446   0.714   0.496
## 4 CNP    CenterPoint Energy Inc.        <tibble>  0.0142  0.0223   0.637   0.542
## 5 TAP    Molson Coors Beverage Company~ <tibble>  0.0140  0.0184   0.759   0.470
## # ... with abbreviated variable names 1: estimate, 2: std.error, 3: statistic

We’re done! We now have the coefficient of the linear regression that tracks the direction of the trend line. We can easily extend this type of analysis to larger lists or stock indexes. For example, the entire S&P500 could be analyzed removing the sample_n() following the call to tq_index("SP500").

4.0 Error Handling when Scaling

Eventually you will run into a stock index, stock symbol, FRED data code, etc that cannot be retrieved. Possible reasons are:

• An index becomes out of date
• A company goes private
• A stock ticker symbol changes
• Yahoo / FRED just doesn’t like your stock symbol / FRED code

This becomes painful when scaling if the functions return errors. So, the tq_get() function is designed to handle errors gracefully. What this means is an NA value is returned when an error is generated along with a gentle error warning.

tq_get("XYZ", "stock.prices")
## [1] NA

Pros and Cons to Built-In Error-Handling

There are pros and cons to this approach that you may not agree with, but I believe helps in the long run. Just be aware of what happens:

• Pros: Long running scripts are not interrupted because of one error

• Cons: Errors can be inadvertently handled or flow downstream if the users does not read the warnings

Let’s see an example when using tq_get() to get the stock prices for a long list of stocks with one BAD APPLE. The argument complete_cases comes in handy. The default is TRUE, which removes “bad apples” so future analysis have complete cases to compute on. Note that a gentle warning stating that an error occurred and was dealt with by removing the rows from the results.

c("AAPL", "GOOG", "BAD APPLE") %>%
tq_get(get = "stock.prices", complete_cases = TRUE)
## Warning: There was 1 warning in dplyr::mutate().
## i In argument: data.. = purrr::map(...).
## Caused by warning:
## ! x = 'BAD APPLE', get = 'stock.prices': Error in getSymbols.yahoo(Symbols = "BAD APPLE", env = <environment>, : Unable to import "BAD APPLE".
## HTTP error 400.
##  Removing BAD APPLE.
## # A tibble: 5,158 x 8
##    symbol date        open  high   low close    volume adjusted
##    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
##  1 AAPL   2013-01-02  19.8  19.8  19.3  19.6 560518000     16.8
##  2 AAPL   2013-01-03  19.6  19.6  19.3  19.4 352965200     16.6
##  3 AAPL   2013-01-04  19.2  19.2  18.8  18.8 594333600     16.2
##  4 AAPL   2013-01-07  18.6  18.9  18.4  18.7 484156400     16.1
##  5 AAPL   2013-01-08  18.9  19.0  18.6  18.8 458707200     16.1
##  6 AAPL   2013-01-09  18.7  18.8  18.4  18.5 407604400     15.9
##  7 AAPL   2013-01-10  18.9  18.9  18.4  18.7 601146000     16.1
##  8 AAPL   2013-01-11  18.6  18.8  18.5  18.6 350506800     16.0
##  9 AAPL   2013-01-14  18.0  18.1  17.8  17.9 734207600     15.4
## 10 AAPL   2013-01-15  17.8  17.8  17.3  17.4 876772400     14.9
## # ... with 5,148 more rows

Now switching complete_cases = FALSE will retain any errors as NA values in a nested data frame. Notice that the error message and output change. The error message now states that the NA values exist in the output and the return is a “nested” data structure.

c("AAPL", "GOOG", "BAD APPLE") %>%
tq_get(get = "stock.prices", complete_cases = FALSE)
## Warning: There was 1 warning in dplyr::mutate().
## i In argument: data.. = purrr::map(...).
## Caused by warning:
## ! x = 'BAD APPLE', get = 'stock.prices': Error in getSymbols.yahoo(Symbols = "BAD APPLE", env = <environment>, : Unable to import "BAD APPLE".
## HTTP error 400.
## # A tibble: 5,159 x 8
##    symbol date        open  high   low close    volume adjusted
##    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
##  1 AAPL   2013-01-02  19.8  19.8  19.3  19.6 560518000     16.8
##  2 AAPL   2013-01-03  19.6  19.6  19.3  19.4 352965200     16.6
##  3 AAPL   2013-01-04  19.2  19.2  18.8  18.8 594333600     16.2
##  4 AAPL   2013-01-07  18.6  18.9  18.4  18.7 484156400     16.1
##  5 AAPL   2013-01-08  18.9  19.0  18.6  18.8 458707200     16.1
##  6 AAPL   2013-01-09  18.7  18.8  18.4  18.5 407604400     15.9
##  7 AAPL   2013-01-10  18.9  18.9  18.4  18.7 601146000     16.1
##  8 AAPL   2013-01-11  18.6  18.8  18.5  18.6 350506800     16.0
##  9 AAPL   2013-01-14  18.0  18.1  17.8  17.9 734207600     15.4
## 10 AAPL   2013-01-15  17.8  17.8  17.3  17.4 876772400     14.9
## # ... with 5,149 more rows

In both cases, the prudent user will review the warnings to determine what happened and whether or not this is acceptable. In the complete_cases = FALSE example, if the user attempts to perform downstream computations at scale, the computations will likely fail grinding the analysis to a hault. But, the advantage is that the user will more easily be able to filter to the problem childs to determine what happened and decide whether this is acceptable or not.