An introduction to Bayesian multilevel models using R, brms, and Stan

class: center, middle, inverse, title-slide

# An introduction to Bayesian multilevel models using R, brms, and Stan
### Ladislas Nalborczyk
### Univ. Grenoble Alpes, CNRS, LPNC
### 28.11.2019

---

# Overview

1. Theoretical background
    + What is Bayesian inference?
    + What is a multilevel model?
    + Introducing the brms package

2. Practical part / tutorial
    + Worked example #1: modelling proportion data with a Bayesian multilevel GLM (BMLGLM?)
    + Worked example #2: fitting a Bayesian three-level meta-analytic model

---

# Who am I?

Just awarded PhD in experimental psychology / cognitive science. My Bayesian adventure started in 2015, reading [Kruschke’s book](https://sites.google.com/site/doingbayesiandataanalysis/) "Doing Bayesian Data Analysis" and starting to use this approach in my research (e.g., see this recent [preprint](https://psyarxiv.com/8v5yd/)).

Since then: I read a few more books (e.g., [Noël, 2010](https://laboutique.edpsciences.fr/produit/773/9782759817566/Psychologie%20statistique%20avec%20R); [Gelman et al. 2013](http://www.stat.columbia.edu/~gelman/book/); [McElreath, 2015](https://xcelab.net/rm/statistical-rethinking/)), I gave a doctoral course at UGA with [Thierry Phénix](https://scholar.google.fr/citations?user=2dMiOKEAAAAJ&hl=fr) (2017-2019), gave a workshop at UGent (2018) with [Paul Bürkner](https://paul-buerkner.github.io/about/) (`brms`'s father), with whom I wrote a tutorial paper about Bayesian multilevel models using `brms` ([Nalborczyk et al., 2019](https://psyarxiv.com/guhsa/)) and a commentary about confidence and credibility intervals ([Nalborczyk, Bürkner & Williams, 2019](https://www.collabra.org/articles/10.1525/collabra.197/)). I also gave a few presentations and wrote a few [blogposts](https://www.barelysignificant.com).

---
class: middle, center

# Theoretical background - Bayesian inference

---

# Exercice - The marbles bag (from McElreath, 2015)

Let's say we have a bag containing four marbles. These marbles come in two colours: blue and white. We know there are four marbles in the bag, but we don't know how many of them are blue...

We know there are five possibilities (we call these possibilities our *hypotheses*):

&#9898 &#9898 &#9898 &#9898
 &#128309 &#9898 &#9898 &#9898
&#128309 &#128309 &#9898 &#9898
&#128309 &#128309 &#128309 &#9898
&#128309 &#128309 &#128309 &#128309

---

# Exercice - The marbles bag (from McElreath, 2015)

Our goal is to determine which hypothesis is the most plausible, given some **evidence** about the content of the bag. To obtain evidence, we can pull some marbles from the bag (with replacement). We do this three times and obtain the following sequence:

&#128309 &#9898 &#128309

This sequence represents some evidence about the content of the bag (in other words, our data). From there, what **inference** can we reasonably make about the content of the bag? What can we say about the relative plausibility of each hypothesis?

---

# Counting possibilities

hypothesis: &#128309 &#9898 &#9898 &#9898 &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; data: &#128309

---

# Counting possibilities

hypothesis: &#128309 &#9898 &#9898 &#9898 &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; data: &#128309 &#9898

---

# Counting possibilities

hypothesis: &#128309 &#9898 &#9898 &#9898 &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; data: &#128309 &#9898 &#128309

---

# Counting possibilities

hypothesis: &#128309 &#9898 &#9898 &#9898 &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; data: &#128309 &#9898 &#128309

---

# Counting possibilities

Under this hypothesis, `$3$` paths (out of `$4^{3}=64$`) are consistent with the data. What about the other hypotheses?

&#9898 &#9898 &#9898 &#128309 &emsp;&emsp;&emsp;&emsp; &#9898 &#128309 &#128309 &#128309 &emsp;&emsp;&emsp;&emsp; &#9898 &#9898 &#128309 &#128309

---

# Model comparison (comparing hypotheses)

Given some data, the most *plausible* hypothesis is the one that **maximise** the possible ways of obtaining these data.

</center>

---

# Accumulating evidence

We previously considered that all hypotheses were a priori equiprobable (cf. the [principle of indifference](https://en.wikipedia.org/wiki/Principle_of_indifference)).

However, we could have prior knowledge, either coming from our beliefs, from previous data or from knowledge about usual bags of this kind.

Let's say we pull another marble out of the bag. How can we incorporate this new information into our previous analysis?

---

# Accumulating evidence

We can update our previous counts by multiplying it by the new information, following the famous Bayesian dictum: "*Yesterday's posterior is today's prior*" ([Lindley, 2000](http://www.phil.vt.edu/dmayo/personal_website/Lindley_Philosophy_of_Statistics.pdf)).

</center>

---

# Using prior information

Someone from the marble factory tells us that blue marbles are rare. For every bag containing three blue marbles, they make two bags that only contain two blue marbles, and three bags that only contain one blue marble. He also tells us that every bag contains at least one marble of each colour.

</center>

---

# From counts to probability

The plausibility of a hypothesis after seeing some data is proportional to the number of ways this hypothesis can "produce" the data, multiplied by its *a priori* plausibility.

`$$p(hypothesis|data)\propto p(data|hypothesis) \times p(hypothesis)$$`
--

From there, we construct probabilities by standardising these plausibilities so that the sum of all plausibilities is equal to 1.

`$$p(hypothesis|data) = \frac{p(data|hypothesis) \times p(hypothesis)}{sum\ of\ products}$$`

---

# From counts to probability

Let's define `$p$` as the proportion of blue marbles.

</center>

```r
ways <- c(0, 3, 8, 9, 0)
(probability <- ways / sum(ways) )
```

```
## [1] 0.00 0.15 0.40 0.45 0.00
```

---

# Notations, terminology

- `$\theta$`: a parameter or vector of parameters (e.g., the proportion of blue marbles).

- `$\color{orangered}{p(x\vert \theta)}$`: the conditional probability of the data `$x$` given `$\theta$`. Once we know `$x$`, can be seen as the likelihood function of `$\theta$`.

- `$\color{steelblue}{p(\theta)}$`: the prior probability of `$\theta$`.

- `$\color{purple}{p(\theta \vert x)}$`: the posterior probability of `$\theta$` (given `$x$`).

- `$\color{green}{p(x)}$`: the marginal probability of `$x$` (over `$\theta$`).

`$$\color{purple}{p(\theta \vert x)} = \dfrac{\color{orangered}{p(x\vert \theta)} \color{steelblue}{p(\theta)}}{\color{green}{p(x)}} = \dfrac{\color{orangered}{p(x\vert \theta)} \color{steelblue}{p(\theta)}}{\color{green}{\sum\limits_{\theta}p(x|\theta)p(\theta)}} = \dfrac{\color{orangered}{p(x\vert \theta)} \color{steelblue}{p(\theta)}}{\color{green}{\int\limits_{\theta}p(x|\theta)p(\theta)\mathrm{d}x}} \propto \color{orangered}{p(x\vert \theta)} \color{steelblue}{p(\theta)}$$`

---

# Probability axioms (Kolmogorov, 1933)

A probability is **a numerical value** assigned to an event `$A$`, this event being a possibility of an ensemble `$\Omega$` (the ensemble of all possible events). Probabilities conform to the following axioms (the third axiom is only valid for mutually exclusive events):

+ **Non-negativity.** `$P(A_{i}) \geq 0$`
+ **Normalisation.** `$P(\Omega) = 1$`
+ **Additivity.** `$P(A_{1} \lor A_{2}) = P(A_{1}) + P(A_{2})$`

This last axiom is also known as the **sum rule**, and can be generalised to non mutually exclusive events: `$P(A_{1} \lor A_{2}) = P(A_{1}) + P(A_{2}) - P(A_{1} \land A_{2})$`.

---

# Deriving Bayes theorem

From these axioms (and the definitions of joint, marginal and conditional probability) derives the **product rule**, which states that:

`$$p(x,y) = p(x|y)p(y) = p(y|x)p(x)$$`
--

Then, restating the above equality,

`$$p(y|x)p(x) = p(x|y)p(y)$$`

And dividing each side by `$p(x)$` gives Bayes theorem, which can be written as:

`$$p(y|x) = \dfrac{p(x|y)p(y)}{p(x)}$$`

Or, in plain English (replacing `$x$` by *data* and `$y$` by *hypothesis*):

`$$p(hypothesis|data) = \frac{p(data|hypothesis) \times p(hypothesis)}{sum\ of\ products}$$`

---

# Bayesian inference

For each statistical problem, we follow three steps (from [Gelman et al., 2013](http://www.stat.columbia.edu/~gelman/book/)):

- Building the model (the history of the data): likelihood + priors

- Updating the model with the information contained in the data using Bayes theorem (aka computing or approximating the posterior probability)

- Evaluating the model, its fit, its assumptions, summarising the results, readjusting the model

> "*Bayesian inference is really just counting and comparing possibilities [...] in order to make good inference about what actually happened, it helps to consider everything that could have happened."* ([McElreath, 2015](http://xcelab.net/rm/statistical-rethinking/))

---

# The problem with posteriors

Relatively simple problems like estimating the mean `$\mu$` and standard deviation `$\sigma$` of a Normal distribution with conjugate Gaussian priors have a closed form solution:

`$$\color{purple}{p(\mu, \sigma | x)} = \frac{\prod_{i} \color{orangered}{\mathrm{Normal}(x_{i}|\mu, \sigma)}\color{steelblue}{\mathrm{Normal}(\mu|0,10)\mathrm{Uniform}(\sigma|0,50)}}
{\color{green}{\int \int \prod_{i} \mathrm{Normal}(x_{i}|\mu,\sigma)\mathrm{Normal}(\mu|0,10)\mathrm{Uniform}(\sigma|0,50)d\mu d\sigma}}$$`

`$$\color{purple}{p(\mu, \sigma | x)} \propto \prod_{i} \color{orangered}{\mathrm{Normal}(x_{i}|\mu, \sigma)}\color{steelblue}{\mathrm{Normal}(\mu|0,10)\mathrm{Uniform}(\sigma|0,50)}$$`

However, for models containing more parameters or featuring non-conjugate priors, the integral (the *marginalising constant* in green) may be very difficult or impossible to compute.

---

# Sampling the unknowns

Instead of computing the marginalising constant, we will **approximate** the shape of the posterior distribution. Different methods exist...

* Grid sampling: computing the exact value of the posterior probability in a finite number of points (can be computationaly very expensive)

* Sampling from the distribution using Markov Chain Monte-Carlo algorithms (much more efficient than grid approximation)...

---
class: middle, center

---
class: middle, center

---
class: middle, center

# Theoretical background - Multilevel linear models

---

# Linear models

The sequel is an excerpt from this [blogpost](https://www.barelysignificant.com/post/glm/): "*Using R to make sense of the generalised linear model*".

> "[Statistical] models are devices that connect theories to data. A model is an instanciation of a theory as a set of probabilistic statements" ([Rouder, Morey, & Wagenmakers, 2016](https://www.collabra.org/articles/10.1525/collabra.28/)).

The usual linear model is of the following form.

$$
`\begin{aligned}
y_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma) \\
\mu_{i} &= \alpha + \beta \cdot x_{i} \\
\end{aligned}`
$$

Where, in Bayesian terms, the first line of the model corresponds to the *likelihood* of the model, which is the assumption made about the data generation process. We make the assumption that the outcomes `$y_{i}$` are normally distributed around a mean `$\mu_{i}$` with some error `$\sigma$`. This is equivalent to say that the errors are normally distributed around `$0$`.

---

# Linear models

Below, we illustrate a simple Gaussian linear model using the `Howell1` dataset from the `rethinking` package ([McElreath, 2016](https://github.com/rmcelreath/rethinking)), which contains data about 544 individuals, including height (centimetres), weight (kilograms), age (years) and gender (0 indicating female and 1 indicating male).

---

# Linear models

A quick visual exploration of the dataset reveals a positive relationship between height and weight. The previously plotted regression line corresponds to the following model, where we assume a Normal likelihood:

$$
`\begin{aligned}
\text{height}_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma) \\
\mu_{i} &= \alpha + \beta \cdot \text{weight}_{i} \\
\end{aligned}`
$$

This model can be fitted easily in `R` with the following syntax.

```r
(mod1 <- lm(height ~ weight, data = d) )
```

```
## 
## Call:
## lm(formula = height ~ weight, data = d)
## 
## Coefficients:
## (Intercept)       weight  
##     113.879        0.905
```

---

# Generalised linear model

Of course, the distributional assumption is not restricted to be Gaussian, and can be adapted to whatever distribution that makes sense in consideration of the data at hand. Generalising to other distributions, the generalised linear model can be rewritten as:

$$
`\begin{aligned}
y_{i} &\sim \mathrm{D}(f(\eta_{i}), \theta) \\
\eta &= \mathbf{X} \beta \\
\end{aligned}`
$$

Where the response `$y_{i}$` is **predicted** through the linear combination `$\eta$` of predictors transformed by the inverse link function `$f$`, assuming a certain distribution `$D$` for `$y$` (also called the *family*), and family-specific parameters `$\theta$` ([Bürkner, 2017](https://www.jstatsoft.org/article/view/v080i01)).

---

# Multilevel modelling

Multilevel models are multi-level generalisations of linear models.

$$
`\begin{align}
\text{Level 1}: y_{i} &\sim \mathrm{Normal}(\mu_{i},\sigma) \\
\mu_{i} &= \alpha_{\text{group}[i]} + \beta \cdot x_{i} \\
\text{Level 2}: \alpha_{\text{group}} &\sim \mathrm{Normal}(\alpha,\sigma_{\text{group}}) \\
\end{align}`
$$

The "prior" on the group intercepts `$\alpha_{\text{group}}$` is a function of two parameters `$\alpha$` and `$\sigma_{\text{café}}$`. These parameters are called **hyper-parameters**, as they are parameters for parameters. There are two levels in the model...

---
class: middle, center

---

# Shrinkage magic (Efron & Morris, 1977)

The James-Stein estimator is defined as `$z = \bar{y} + c(y - \bar{y})$`, where `$\bar{y}$` is the grand average, `$y$` an individual estimation and `$c$` a constant, the "shrinking factor" ([Efron & Morris, 1977](http://statweb.stanford.edu/~ckirby/brad/other/Article1977.pdf)).

---

# Shrinkage magic (Efron & Morris, 1977)

The James-Stein estimator is determined both by the variability in the measure (e.g., its standard deviation, influencing the shrinking factor `$c$`) and by its distance to the average estimation (i.e., `$y - \bar{y}$`). In other words, extreme observations are trusted less. Shrinkage therefore acts as **a safeguard against overfitting** in multilevel models.

---
class: middle, center

# Theoretical background - Introducing brms

---

# Under the hood: Stan

`Stan` is a probabilistic programming language written in `C++`. It implements several MCMC algorithms such as HMC, NUTS, L-BFGS...

```r
data {
 int<lower=0> J; // number of schools 
 real y[J]; // estimated treatment effects
 real<lower=0> sigma[J]; // s.e. of effect estimates 
}

parameters {
 real mu; 
 real<lower=0> tau;
 real eta[J];
}

transformed parameters {
  real theta[J];
  for (j in 1:J)
    theta[j] = mu + tau * eta[j];
}

model {
  target += normal_lpdf(eta | 0, 1);
  target += normal_lpdf(y | theta, sigma);
}
```

---

# Bayesian Regression Models using Stan

The `brms` package ([Bürkner, 2017](https://www.jstatsoft.org/article/view/v080i01)) allows fitting complex Bayesian multilevel linear and non-linear models in `Stan` using an intuitive high-level syntax (similar to the syntax of `lme4`). For instance, the following model:

$$
`\begin{align}
y_{i} &\sim \mathrm{Normal}(\mu_{i}, \sigma) \\
\mu_{i} &= \alpha + \alpha_{subject[i]} + \alpha_{item[i]} + \beta \cdot x_{i} \\
\end{align}`
$$

can be fitted using `brms` (as with `lme4`) as follows:

```r
brm(y ~ x + (1|subject) + (1|item), data = d, family = gaussian() )
```

---

# Example: Analysing the sleepstudy dataset

The `sleepstudy` dataset data from the study described in [Belenky et al. (2003)](https://www.ncbi.nlm.nih.gov/pubmed/12603781). It describes the average reaction time per day for subjects in a sleep deprivation study. On day 0 the subjects had their normal amount of sleep. Starting that night they were restricted to 3 hours of sleep per night. The observations represent the average reaction time on a series of tests given each day to each subject.

```r
library(lme4)
data(sleepstudy)
head(sleepstudy, 10)
```

```
##    Reaction Days Subject
## 1  249.5600    0     308
## 2  258.7047    1     308
## 3  250.8006    2     308
## 4  321.4398    3     308
## 5  356.8519    4     308
## 6  414.6901    5     308
## 7  382.2038    6     308
## 8  290.1486    7     308
## 9  430.5853    8     308
## 10 466.3535    9     308
```

---

# Example: Analysing the sleepstudy dataset

```r
sleepstudy %>%
    ggplot(aes(x = Days, y = Reaction) ) +
    geom_smooth(method = "lm", colour = "black") +
    geom_point() +
    facet_wrap(~Subject, nrow = 2) +
    theme_bw(base_size = 10)
```

---

# Example: Analysing the sleepstudy dataset

The `get_prior()` function returns a list of all priors to be specified given some `formula` and some data, as well as the default priors (if any) used by `brms`.

```r
library(brms)
get_prior(Reaction ~ Days + (1 + Days|Subject), sleepstudy)
```

```
##                    prior     class      coef   group resp dpar nlpar bound
## 1                                b                                        
## 2                                b      Days                              
## 3                 lkj(1)       cor                                        
## 4                              cor           Subject                      
## 5  student_t(3, 289, 59) Intercept                                        
## 6    student_t(3, 0, 59)        sd                                        
## 7                               sd           Subject                      
## 8                               sd      Days Subject                      
## 9                               sd Intercept Subject                      
## 10   student_t(3, 0, 59)     sigma
```

---

# Example: Analysing the sleepstudy dataset

Priors can be specified in several ways, explained in the documentation (accessible by typing `?prior`).

```r
prior1 <- c(
 prior(normal(200, 10), class = Intercept),
 prior(normal(0, 10), class = b, coef = Days),
 prior(cauchy(0, 10), class = sigma),
 prior(lkj(2), class = cor)
 )
```

```r
mod1 <- brm(
 Reaction ~ Days + (1 + Days | Subject),
 data = sleepstudy,
 family = gaussian(),
 prior = prior1,
 warmup = 2000, iter = 1e4,
 chains = 2
 )
```

---

# Example: Analysing the sleepstudy dataset

```r
posterior_summary(mod1, pars = c("^b_", "^sd_", "sigma"), probs = c(0.025, 0.975) )
```

```
##                          Estimate Est.Error       Q2.5      Q97.5
## b_Intercept           216.0746641 15.535217 183.859586 245.620803
## b_Days                  0.7233977  3.510254  -6.507006   7.198048
## sd_Subject__Intercept  45.6950094 15.437435  22.305404  81.634903
## sd_Subject__Days       12.2214325  3.974647   6.270876  21.588890
## sigma                  25.6498007  1.514462  22.878514  28.826137
```

---

# Some elements of syntax

The `brms` package features the same syntax as `R` base functions such as `lm()` or functions from the `lme4` package.

```r
Reaction ~ Days + (1 + Days | Subject)
```

The left-hand side of the formula defines the *outcome* (i.e., what is predicted). The `brms` package also allows fitting multivariate (i.e., with several outcomes) models by combining these outcomes with `mvbind()`:

```r
mvbind(Reaction, Memory) ~ Days + (1 + Days | Subject)
```

The right-hand side of the formula defines the *predictors* (i.e., what is used to predict the outcome.s). Note that an intercept is implicitly assumed by default in `R`, so that the two notations below correspond to the same model.

```r
mvbind(Reaction, Memory) ~ Days + (1 + Days | Subject)
mvbind(Reaction, Memory) ~ 1 + Days + (1 + Days | Subject)
```

---

# Some elements of syntax

To fit a model without an intercept (because we can), we need to explicitly specify it as follows.

```r
mvbind(Reaction, Memory) ~ 0 + Days + (1 + Days | Subject)
```

The first part of the right-hand side of the formula corresponds to *constant effects* (aka *fixed effects*), whereas the second part (inside parentheses) corresponds to *varying effects* (aka *random effects*).

```r
mvbind(Reaction, Memory) ~ Days + (1 | Subject)
mvbind(Reaction, Memory) ~ Days + (Days | Subject)
```

The first above model contains an intercept varying by `Subject`. The second models contains in addition a slope varying by `Subject`.

---

# Some elements of syntax

When we include several varying effects (e.g., both a varying intercept and a varying slope), `brms` estimates the correlation between each pair of varying effects by default. If we are not interested in estimating this correlation, we need to remove this parameter from the model (that is equivalent to fixing its value to 0) by using `||`.

```r
mvbind(Reaction, Memory) ~ Days + (1 + Days || Subject)
```

The previous models assumed a Gaussian response distribution. We can change this assumption by using the `family` argument.

```r
brm(Reaction ~ 1 + Days + (1 + Days | Subject), family = lognormal() )
```

---

# Helpful brms functions

```r
# Specify the model using R formulas:
brmsformula(formula, ...)

# Generate the Stan code:
make_stancode(formula, ...)
stancode(fit)

# Generate the data passed to Stan:
make_standata(formula, ...)
standata(fit)

# Handle priors:
get_prior(formula, ...)
set_prior(prior, ...)

# Generate expected values and predictions:
fitted(fit, ...)
predict(fit, ...)
marginal_effects(fit, ...)

# Model comparison:
loo(fit1, fit2, ...)
bayes_factor(fit1, fit2, ...)
model_weights(fit1, fit2, ...)

# Hypothesis testing:
hypothesis(fit, hypothesis, ...)
```

---
class: middle, center

# Worked example #1 - Experimental absenteeism

---

# Worked example #1 - Experimental absenteeism

Working with human subjects requires a minimum amount of reciprocal cooperation. But that requirement is not always fulfilled. A non-negligible proportion of students that register for a Psychology experiment in our lab do not show up for the experiment... We wanted to estimated the probability that a registered participant would come to the lab, depending on whether s•he received a reminder by email (this example is presented in more depth in two blogposts, available [here](http://www.barelysignificant.com/post/absenteeism/) and [there](http://www.barelysignificant.com/post/absenteeism2/)).

```r
library(tidyverse)

data <- read.csv("absenteeism.csv")
data %>% sample_frac %>% head(10)
```

```
##    reminder researcher presence absence total
## 1         1          3       87       9    96
## 2        -1          5       34      49    83
## 3        -1         10       23      58    81
## 4        -1          8       38      54    92
## 5         1          5       65      18    83
## 6        -1          1       16      86   102
## 7        -1          4       61      34    95
## 8         1          8       81      11    92
## 9         1          1       96       6   102
## 10       -1          2       53      59   112
```

---

# Worked example #1 - Experimental absenteeism

In other words, we want to estimate the probability `$p_{i}$` of the participant being present. We assume `$y_{i}$` (i.e., the number of participants being present, out of `$n_{i}$` participants) to be binomial distributed with probability of being present `$p_{i}$`.

`$$y_{i} \sim \mathrm{Binomial}(n_{i}, p_{i})$$`

The probability `$p_{i}$` is predicted via a linear combination of some predictors:

`$$logit(p_{i}) = log \Big(\frac{p_{i}}{1-p_{i}}\Big) = \alpha + \beta \times \text{reminder}_{i}$$`

Where `$logit(\cdot)$` is a **link function**, relating the bounded scale of the parameter (i.e., `$0 \leq \theta \leq 1$`) to the unbounded scale of the predictors.

---

# Worked example #1 - Experimental absenteeism

We can extend this model on several levels, allowing the average probability of presence (i.e., the intercept) and the effect of the reminder (i.e., the slope) to vary by researcher.

---

# Worked example #1 - Experimental absenteeism

```r
prior2 <- c(
 prior(normal(0, 10), class = Intercept, coef = ""),
 prior(cauchy(0, 10), class = sd),
 prior(normal(0, 10), class = b),
 prior(lkj(2), class = cor)
 )
```

```r
mod2 <- brm(
 presence | trials(total) ~ 1 + reminder + (1 + reminder|researcher), 
 family = binomial(link = "logit"),
 prior = prior2,
 data = data,
 sample_prior = TRUE,
 warmup = 2000, iter = 1e4,
 cores = parallel::detectCores(),
 control = list(adapt_delta = 0.95)
 )
```

---

# Worked example #1 - Experimental absenteeism

```r
mod2 %>%
    plot(
        combo = c("hist", "trace"), widths = c(1, 1.5),
        theme = theme_bw(base_size = 16)
        )
```

---

# Worked example #1 - Experimental absenteeism

```r
posterior_summary(mod2, pars = c("^b_", "^sd_"), probs = c(0.025, 0.975) )
```

```
##                           Estimate Est.Error      Q2.5     Q97.5
## b_Intercept              0.8403057 0.2960996 0.2602550 1.4320422
## b_reminder               1.6052251 0.1762257 1.2564517 1.9665889
## sd_researcher__Intercept 0.8702792 0.2855384 0.4812413 1.5746246
## sd_researcher__reminder  0.4777322 0.1742964 0.2295891 0.8997469
```

These estimations are in the log-odds scale... To interpret them more easily, we need to apply the inverse link function (i.e., the logit-inverse). For instance, the average probability of presence (i.e., whatever the researcher or the reminder status) is given by `$p = \exp(\alpha) / (1 + \exp(\alpha) )$`.

```r
# retrieving the intercept
a <- fixef(mod2)[1]

# transforming it back to the probability scale (equivalent to plogis(a))
exp(a) / (1 + exp(a) )
```

```
## [1] 0.6985296
```

---

# Worked example #1 - Experimental absenteeism

Then, we assessed the effect of the reminder email. Here again, it is difficult to interpret the raw estimates... but we know that `$\text{exp}(\beta)$` gives an [odds ratio](https://en.wikipedia.org/wiki/Odds_ratio).

```r
fixef(mod2)[2, c(1, 3, 4)] %>% exp
```

```
## Estimate     Q2.5    Q97.5 
## 4.978980 3.512934 7.146258
```

We can interpret this ratio (OR = 4.99, 95% HDI [3.51, 7.19]) by saying that it is 5 times more probable that a participant will be present if the participant received a reminder email...

---

# Plotting the predictions of the model

An elegant way of representing the predictions of the model and its uncertainty is to directly plot some samples (i.e., some regression lines) from the posterior distribution. We call this kind of plot a *spaghetti plot*.

---

# Plotting the predictions of the model

```r
data %>%
    group_by(researcher, total) %>%
    data_grid(reminder = seq_range(reminder, n = 1e2) ) %>%
    add_fitted_samples(mod2, newdata = ., n = 100, scale = "linear") %>%
    mutate(estimate = plogis(estimate) ) %>%
    ggplot(aes(x = reminder, y = estimate, group = .iteration) ) +
    geom_hline(yintercept = 0.5, lty = 2) +
    geom_line(aes(y = estimate, group = .iteration), size = 0.5, alpha = 0.1) +
    facet_wrap(~researcher, nrow = 2) +
    theme_bw(base_size = 20) + labs(x = "Reminder", y = "Estimate")
```

---

# Hypothesis test - 1

The `brms` package offers several ways of testing hypotheses. The `hypothesis()` function computes an *evidence ratio* (equivalent to a Bayes factor for point hypotheses, such as `$\theta = 0$`). When testing point hypotheses, this evidence ratio is computed using the **Savage-Dickey** method.

```r
(hyp1 <- hypothesis(mod2, "reminder = 0") )
```

```
## Hypothesis Tests for class b:
##       Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob
## 1 (reminder) = 0     1.61      0.18     1.26     1.97          0         0
##   Star
## 1    *
## ---
## 'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
## '*': For one-sided hypotheses, the posterior probability exceeds 95%;
## for two-sided hypotheses, the value tested against lies outside the 95%-CI.
## Posterior probabilities of point hypotheses assume equal prior probabilities.
```

```r
1 / hyp1$hypothesis$Evid.Ratio
```

```
## [1] -7.810388e+15
```

---

# Hypothesis test - 1

```r
plot(hyp1, theme = theme_bw(base_size = 20) )
```

---

# Comparing the posterior to the prior

```r
data.frame(prior = hyp1$prior_samples$H1, posterior = hyp1$samples$H1) %>%
    gather(type, value) %>%
    ggplot(aes(x = value) ) +
    geom_histogram(bins = 50, alpha = 0.8) +
    geom_vline(xintercept = 0, lty = 2, size = 1) +
    facet_wrap(~type, scales = "free") +
    xlab(expression(beta[reminder]) ) +
    theme_bw(base_size = 20)
```

---

# Hypothesis test - 2

We can also follow the **model comparison** approach. Testing the hypothesis of `$\theta = 0$` is equivalent to comparing two models: a model including the effect of interest and a model without this effect.

```r
prior3 <- c(
 prior(normal(0, 10), class = Intercept, coef = ""),
 prior(cauchy(0, 10), class = sd),
 prior(lkj(2), class = cor) )

mod2 <- brm(presence | trials(total) ~ 1 + reminder + (1 + reminder|researcher), 
 family = binomial(link = "logit"),
 prior = prior2,
 data = data,
 # this line is important for bridgesampling
 save_all_pars = TRUE,
 warmup = 2000, iter = 1e4,
 cores = parallel::detectCores(),
 control = list(adapt_delta = 0.95) )

mod3 <- brm(presence | trials(total) ~ 1 + (1 + reminder|researcher), 
 family = binomial(link = "logit"),
 prior = prior3,
 data = data,
 save_all_pars = TRUE,
 warmup = 2000, iter = 1e4,
 cores = parallel::detectCores(),
 control = list(adapt_delta = 0.95) )
```

---

# Hypothesis test - 2

We can then compare the marginal likelihood of these two models (i.e., compute a Bayes factor). The `brms` package offers the `bayes_factor()` method, which relies on an approximation of the marginal likelihood through `bridgesampling` ([Gronau & Singmann, 2017](https://arxiv.org/abs/1710.08162)).

```r
bayes_factor(mod2, mod3)
```

```
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
```

```
## Estimated Bayes factor in favor of bridge1 over bridge2: 3015.96437
```

---

# Posterior predictive checking

Another important tool to assess the predictive abilities of the model is known as *posterior preditive checking* (PPC). The idea is simple: we compare the observed data to data generated from the posterior distribution. Once we have a posterior distribution on `$\theta$`, we can generate data from the *posterior predictive distribution*:

`$$p(\widetilde{y} | y) = \int p(\widetilde{y} | \theta) p(\theta | y) d \theta$$`

If a model is a good model, it should be able to generate data that looks like the observed data (e.g., [Gabry, Simpson, Vehtari, Betancourt, & Gelman, 2017](https://arxiv.org/abs/1709.01449)). In some sense, posterior predictive checking can be seen as a generalisation of the NHST logic.

---

# Posterior predictive checking

```r
data %>%
    ggplot(aes(x = presence / total) ) +
    geom_density(fill = "grey20") +
    theme_bw(base_size = 20)
```

---

# Posterior predictive checking

This procedure is implemented in `brms` via the `pp_check()` method, which allows various kind of checks. For instance, we compare below the a posteriori predictions of the model (n = 100) to the observed data.

```r
pp_check(mod2, nsamples = 1e2) + theme_bw(base_size = 20)
```

---

# Posterior predictive checking

```r
pp_check(mod2, nsamples = 1e3, type = "stat_2d") + theme_bw(base_size = 20)
```

---

# Adjusting Stan's behaviour

When fitting more complex models, you will probably have to face warning messages such as "`There were x divergent transitions after warmup`". You can adjust `Stan`'s behaviour inside the `brm()` call via the `control` argument.

```r
mod2 <- brm(
 presence | trials(total) ~ 1 + reminder + (1 + reminder|researcher), 
 family = binomial(link = "logit"),
 prior = prior2,
 data = data,
 warmup = 2000, iter = 1e4,
 cores = parallel::detectCores(), # using all availables cores
 control = list(adapt_delta = 0.95) # adjusting the delta step size
 )
```

For instance, we can increase the step of the algorithm via `adapt_delta` (fixed to 0.8 by default). This will probably slow down the sampling but will also improve the "quality" of the obtained samples. More generally, pay attention to warnings and messages from `brms` as they are usually quite informative and useful.

---
class: middle, center