This vignette provides an introduction on how to fit distributional
regression models with **brms**. We use the term
*distributional model* to refer to a model, in which we can
specify predictor terms for all parameters of the assumed response
distribution. In the vast majority of regression model implementations,
only the location parameter (usually the mean) of the response
distribution depends on the predictors and corresponding regression
parameters. Other parameters (e.g., scale or shape parameters) are
estimated as auxiliary parameters assuming them to be constant across
observations. This assumption is so common that most researchers
applying regression models are often (in my experience) not aware of the
possibility of relaxing it. This is understandable insofar as relaxing
this assumption drastically increase model complexity and thus makes
models hard to fit. Fortunately, **brms** uses
**Stan** on the backend, which is an incredibly flexible
and powerful tool for estimating Bayesian models so that model
complexity is much less of an issue.

Suppose we have a normally distributed response variable. Then, in
basic linear regression, we specify a predictor term \(\eta_{\mu}\) for the mean parameter \(\mu\) of the normal distribution. The
second parameter of the normal distribution – the residual standard
deviation \(\sigma\) – is assumed to be
constant across observations. We estimate \(\sigma\) but do not try to *predict*
it. In a distributional model, however, we do exactly this by specifying
a predictor term \(\eta_{\sigma}\) for
\(\sigma\) in addition to the predictor
term \(\eta_{\mu}\). Ignoring
group-level effects for the moment, the linear predictor of a parameter
\(\theta\) for observation \(n\) has the form

\[\eta_{\theta n} = \sum_{i = 1}^{K_{\theta}} b_{\theta i} x_{\theta i n}\] where \(x_{\theta i n}\) denotes the value of the \(i\)th predictor of parameter \(\theta\) for observation \(n\) and \(b_{\theta i}\) is the \(i\)th regression coefficient of parameter \(\theta\). A distributional normal model with response variable \(y\) can then be written as

\[y_n \sim \mathcal{N}\left(\eta_{\mu n}, \, \exp(\eta_{\sigma n}) \right)\] We used the exponential function around \(\eta_{\sigma}\) to reflect that \(\sigma\) constitutes a standard deviation and thus only takes on positive values, while a linear predictor can be any real number.

Unequal variance models are possibly the most simple, but nevertheless very important application of distributional models. Suppose we have two groups of patients: One group receives a treatment (e.g., an antidepressive drug) and another group receives placebo. Since the treatment may not work equally well for all patients, the symptom variance of the treatment group may be larger than the symptom variance of the placebo group after some weeks of treatment. For simplicity, assume that we only investigate the post-treatment values.

```
group <- rep(c("treat", "placebo"), each = 30)
symptom_post <- c(rnorm(30, mean = 1, sd = 2), rnorm(30, mean = 0, sd = 1))
dat1 <- data.frame(group, symptom_post)
head(dat1)
```

```
group symptom_post
1 treat 0.7909655
2 treat 0.7123066
3 treat 2.2067510
4 treat 4.6125175
5 treat 2.2597224
6 treat 1.5266883
```

The following model estimates the effect of `group`

on
both the mean and the residual standard deviation of the normal response
distribution.

Useful summary statistics and plots can be obtained via

The population-level effect `sigma_grouptreat`

, which is
the contrast of the two residual standard deviations on the log-scale,
reveals that the variances of both groups are indeed different. This
impression is confirmed when looking at the
`conditional_effects`

of `group`

. Going one step
further, we can compute the residual standard deviations on the original
scale using the `hypothesis`

method.

```
hyp <- c("exp(sigma_Intercept) = 0",
"exp(sigma_Intercept + sigma_grouptreat) = 0")
hypothesis(fit1, hyp)
```

```
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 (exp(sigma_Interc... = 0 0.93 0.13 0.72 1.23 NA NA *
2 (exp(sigma_Interc... = 0 2.61 0.34 2.06 3.40 NA NA *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.
```

We may also directly compare them and plot the posterior distribution of their difference.

```
hyp <- "exp(sigma_Intercept + sigma_grouptreat) > exp(sigma_Intercept)"
(hyp <- hypothesis(fit1, hyp))
```

```
Hypothesis Tests for class b:
Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 (exp(sigma_Interc... > 0 1.68 0.36 1.13 2.32 Inf 1 *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.
```

Indeed, the residual standard deviation of the treatment group seems to larger than that of the placebo group. Moreover the magnitude of this difference is pretty similar to what we expected due to the values we put into the data simulations.

Another important application of the distributional regression framework are so called zero-inflated models. These models are helpful whenever there are more zeros in the response variable than one would naturally expect. For example, if one seeks to predict the number of cigarettes people smoke per day and also includes non-smokers, there will be a huge amount of zeros which, when not modeled appropriately, can seriously distort parameter estimates. Here, we consider an example dealing with the number of fish caught by various groups of people. On the UCLA website (), the data are described as follows: “The state wildlife biologists want to model how many fish are being caught by fishermen at a state park. Visitors are asked how long they stayed, how many people were in the group, were there children in the group and how many fish were caught. Some visitors do not fish, but there is no data on whether a person fished or not. Some visitors who did fish did not catch any fish so there are excess zeros in the data because of the people that did not fish.”

```
nofish livebait camper persons child xb zg count
1 1 0 0 1 0 -0.8963146 3.0504048 0
2 0 1 1 1 0 -0.5583450 1.7461489 0
3 0 1 0 1 0 -0.4017310 0.2799389 0
4 0 1 1 2 1 -0.9562981 -0.6015257 0
5 0 1 0 1 0 0.4368910 0.5277091 1
6 0 1 1 4 2 1.3944855 -0.7075348 0
```

As predictors we choose the number of people per group, the number of children, as well as whether the group consists of campers. Many groups may not even try catching any fish at all (thus leading to many zero responses) and so we fit a zero-inflated Poisson model to the data. For now, we assume a constant zero-inflation probability across observations.

Again, we summarize the results using the usual methods.

```
Family: zero_inflated_poisson
Links: mu = log; zi = identity
Formula: count ~ persons + child + camper
Data: zinb (Number of observations: 250)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -1.01 0.18 -1.36 -0.66 1.00 2685 2154
persons 0.88 0.05 0.79 0.96 1.00 2711 2450
child -1.36 0.09 -1.55 -1.18 1.00 2636 2486
camper 0.80 0.09 0.62 0.99 1.00 2969 2816
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
zi 0.41 0.05 0.32 0.50 1.00 2825 2535
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
```

According to the parameter estimates, larger groups catch more fish,
campers catch more fish than non-campers, and groups with more children
catch less fish. The zero-inflation probability `zi`

is
pretty large with a mean of 41%. Please note that the probability of
catching no fish is actually higher than 41%, but parts of this
probability are already modeled by the Poisson distribution itself
(hence the name zero-*inflation*). If you want to treat all zeros
as originating from a separate process, you can use hurdle models
instead (not shown here).

Now, we try to additionally predict the zero-inflation probability by the number of children. The underlying reasoning is that we expect groups with more children to not even try catching fish. Most children are just terribly bad at waiting for hours until something happens. From a purely statistical perspective, zero-inflated (and hurdle) distributions are a mixture of two processes and predicting both parts of the model is natural and often very reasonable to make full use of the data.

```
fit_zinb2 <- brm(bf(count ~ persons + child + camper, zi ~ child),
data = zinb, family = zero_inflated_poisson())
```

```
Family: zero_inflated_poisson
Links: mu = log; zi = logit
Formula: count ~ persons + child + camper
zi ~ child
Data: zinb (Number of observations: 250)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -1.08 0.18 -1.44 -0.74 1.00 2447 2539
zi_Intercept -0.95 0.25 -1.48 -0.50 1.00 3577 2454
persons 0.89 0.05 0.80 0.98 1.00 2568 2230
child -1.18 0.09 -1.36 -1.00 1.00 2898 2676
camper 0.78 0.09 0.60 0.96 1.00 3203 2716
zi_child 1.22 0.27 0.70 1.77 1.00 3846 2652
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
```

To transform the linear predictor of `zi`

into a
probability, **brms** applies the logit-link:

\[logit(zi) = \log\left(\frac{zi}{1-zi}\right) = \eta_{zi}\]

The logit-link takes values within \([0, 1]\) and returns values on the real line. Thus, it allows the transition between probabilities and linear predictors.

According to the model, trying to fish with children not only decreases the overall number fish caught (as implied by the Poisson part of the model) but also drastically increases your change of catching no fish at all (as implied by the zero-inflation part) most likely because groups with more children are not even trying.

In the examples so far, we did not have multilevel data and thus did
not fully use the capabilities of the distributional regression
framework of **brms**. In the example presented below, we
will not only show how to deal with multilevel data in distributional
models, but also how to incorporate smooth terms (i.e., splines) into
the model. In many applications, we have no or only a very vague idea
how the relationship between a predictor and the response looks like. A
very flexible approach to tackle this problems is to use splines and let
them figure out the form of the relationship. For illustration purposes,
we simulate some data with the **mgcv** package, which is
also used in **brms** to prepare smooth terms.

`Gu & Wahba 4 term additive model`

```
y x0 x1 x2 x3 f
1 10.298249 0.83315952 0.4099156 0.4295363 0.37194472 9.796595
2 14.386518 0.38234560 0.1369489 0.5086811 0.06183478 11.918901
3 13.524314 0.32071356 0.1368596 0.5304031 0.23917648 14.788755
4 18.044075 0.59229048 0.6740755 0.7097616 0.94099180 20.534928
5 8.562205 0.08678786 0.5179809 0.4422262 0.76987721 9.638897
6 9.395546 0.47027669 0.1973141 0.8559711 0.05406273 9.797828
```

The data contains the predictors `x0`

to `x3`

as well as the grouping factor `fac`

indicating the nested
structure of the data. We predict the response variable `y`

using smooth terms of `x1`

and `x2`

and a varying
intercept of `fac`

. In addition, we assume the residual
standard deviation `sigma`

to vary by a smoothing term of
`x0`

and a varying intercept of `fac`

.

```
fit_smooth1 <- brm(
bf(y ~ s(x1) + s(x2) + (1|fac), sigma ~ s(x0) + (1|fac)),
data = dat_smooth, family = gaussian(),
chains = 2, control = list(adapt_delta = 0.95)
)
```

```
Family: gaussian
Links: mu = identity; sigma = log
Formula: y ~ s(x1) + s(x2) + (1 | fac)
sigma ~ s(x0) + (1 | fac)
Data: dat_smooth (Number of observations: 200)
Draws: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 2000
Smooth Terms:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(sx1_1) 4.70 2.66 1.19 11.04 1.00 678 1025
sds(sx2_1) 21.66 5.70 13.41 35.51 1.00 743 1005
sds(sigma_sx0_1) 0.75 0.82 0.02 3.00 1.01 752 1172
Group-Level Effects:
~fac (Number of levels: 4)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 4.82 2.18 2.22 10.37 1.00 932 1264
sd(sigma_Intercept) 0.22 0.33 0.01 1.12 1.00 438 338
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 15.22 2.21 10.82 19.65 1.00 819 1185
sigma_Intercept 0.72 0.24 0.33 1.03 1.00 427 324
sx1_1 17.18 7.97 3.59 35.48 1.00 1268 1035
sx2_1 62.59 15.37 32.06 94.09 1.00 1630 1507
sigma_sx0_1 -0.91 1.86 -5.63 2.23 1.01 956 651
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
```

This model is likely an overkill for the data at hand, but nicely
demonstrates the ease with which one can specify complex models with
**brms** and to fit them using **Stan** on the
backend.