## Model comparison

Statisticians love coins. And there is a reason for that. These toy examples are extremely useful to illustrate difficult concepts while still being expressible in a convenient mathematical framework. As an example, let’s say we want to estimate the bias \(\theta\) of a coin (i.e., its probability of landing heads up).

We can compare two models of the bias \(\theta\), these two models differing only by their prior on the value of \(\theta\). For instance, I could bet with a friend that the coin is probably biased toward tails (i.e., that the coin would, on average, lands more often tails up than heads up), while my friend rather think that the coin is biased toward heads. However, while we think that the coin is probably slightly biased, neither of us is completely certain of the exact value of this bias. Fortunately, probability theory gives us a way to express our prediction (as well as our uncertainty) in the form of a probability distribution.

For convenience, we can write our predictions as two beta-binomial models:

\[ \begin{align} \mathcal{M_{1}} : y_{i} &\sim \mathrm{Binomial}(1, \theta) \\ \theta &\sim \mathrm{Beta}(6, 10) \\ \end{align} \]

\[ \begin{align} \mathcal{M_{2}} : y_{i} &\sim \mathrm{Binomial}(1, \theta) \\ \theta &\sim \mathrm{Beta}(20, 12) \\ \end{align} \]

Where \(\mathcal{M_{1}}\) represents my predictions about \(\theta\) while \(\mathcal{M_{2}}\) represents the predictions of my friend. These two priors are better understood visually.

My prior (in blue) is centered on \(\frac{\alpha}{\alpha + \beta} = 0.375\) while the prior of my friend (in red) is centered on \(\frac{\alpha}{\alpha + \beta} = 0.625\). Note that my friend seems a bit more certain of its prediction than me, as the red prior is a bit more narrow than the blue prior.

Now we can collect some data to test our respective hypotheses. We launch the coin 100 times and gather the following data:

```
## [1] 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 1
## [36] 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 0 1 0 1
## [71] 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1
```

Then, we would like to know under which hypothesis (i.e., under which prior assumption) these data are the more probable to appear. To answer this question, we can compare the *marginal likehoods* of the two above models.

## Bayes Factors

Formally, the Bayes Factor is defined as *a ratio of marginal likelihoods* and is given by:

\[\text{BF}_{12} = \dfrac{p(D|H_{1})}{p(D|H_{2})} = \dfrac{\int p(\theta|H_{1}) p(D|\theta,H_{1}) \text{d}\theta}{\int p(\theta|H_{2}) p(D|\theta,H_{2}) \text{d}\theta} = \dfrac{\int \mathrm{Binomial}(1, \theta)\mathrm{Beta}(6, 10)\text{d}\theta}{\int \mathrm{Binomial}(1, \theta)\mathrm{Beta}(20, 12) \text{d}\theta}\]

In simple words, computing a Bayes Factor corresponds to multiplying the prior by the *likelihood* (the information contained in the data). The frightening integral means that we multiply the *likelihood* to the prior for each possible value of \(\theta\), and that we sum all of the obtained values^{1}. This process is illustrated below.

We can read this animation starting by the lower left panel and going clockwise until the lower right panel. In the first panel (the lower left panel), the black curve represents the likelihood function (which is the same for both models) and gives the probability of \(\theta\) given the observed data. The two coloured curves represent the prior of either I (in blue) or my friend (in red).

The moving dots represent the density of either the priors or the likelihood at each possible value of \(\theta\)^{2}. The *marginal likelihood* (see the upper left panel) is constructed by multiplying the probability assigned by the *prior* (the “height” of the blue/red dot) to the probability assigned by the *likelihood* function (the “height” of the black dot) for each possible value of \(\theta\).

We can then “unfold” this *marginal likelihood* by computing the *cumulative marginal likelihood* (see the upper right panel). This curve simply represents the value of the *marginal likehood* for each value of \(\theta\), to which we add the value of the *marginal likelihood* for all the previous values of \(\theta\) (by summing them). For instance, the value of the *cumulative marginal likelihood* at \(\theta = 0.14\) is equal to the value of the *marginal likelihood* at \(\theta = 0.14\), plus the value of the *marginal likelihood* at all the previous values of \(\theta\) between 0 and 0.14.

The Bayes Factor (see the lower right panel) is then simply computed as *the ratio of the cumulative marginal likelihoods at the upper limit of the interval of integration*.

Let’s say it again. The Bayes Factor is the ratio of the cumulative marginal likelihood at the upper limit of the interval of integration. In our case, it corresponds to the ratio of the heights of the *cumulative marginal likelihoods* when \(\theta = 1\) (i.e., approximately 8 divided by 2). The resulting Bayes Factor of 2.19 means that the data are 2.19 times more likely under my prior than under the prior of my friend.

## They are not posterior odds

Be careful to not interpret Bayes Factors as *posterior odds*. Bayes Factors indicate how much we whould update our *prior odds*, in the light of new incoming data. They *do not tell us what is the most probable hypothesis*, given the data (unless the prior odds are 1:1).

Let’s take another example and compare two hypotheses:

- \(H_{0}\): there is no such thing as precognition
- \(H_{1}\): precognition does exist

We run an experiment and observe a \(BF_{10} = 27\). What is the posterior probability of H1 ? We can compute this probability by multiplying the Bayes Factor to the prior odds of the two hypotheses. How to define the prior odds is a tricky question and this choice should be open to scrutiny and criticism. Here I use an arbitrary skeptical prior of 1:1000 against H1.

\[\underbrace{\dfrac{p(H_{1}|D)}{p(H_{0}|D)}}_{posterior\ odds} = \underbrace{\dfrac{27}{1}}_{Bayes\ factor} \times \underbrace{\dfrac{1}{1000}}_{prior\ odds} = \dfrac{27}{1000} = 0.027\]

This result demonstrates that a relatively high Bayes Factor (representing some relatively strong evidence) should always be considered in the light of what was knew before the data. A priori highly improbable claims can still remain highly improbable, even after having been corroborated by the data.

## Conclusions

Bayes Factors are increasingly used in psychology. Unfortunately, this does not warrant increased understanding of what they are. I wrote this blogpost to illustrate and summarise what they are and how they can be interpreted as updating factors.

Curious readers might be interested in this shiny application, which basically reproduces the animation presented in this post while allowing to play with parameters of the priors and the likelihood. The R code to reproduce the animation is available here.

This post is greatly inspired by a series of blogposts from Alexander Etz, Jeff Rouder and Richard Morey (amongst others) on Bayes Factors (see references below).

## References

## Click to expand

Etz, A. (2015, August 9). Understanding Bayes: Visualization of the Bayes Factor [Blog post]. Retrieved from https://alexanderetz.com/2015/08/09/understanding-bayes-visualization-of-bf/

Etz, A. (2015, April 15). Understanding Bayes: A Look at the Likelihood [Blog post]. Retrieved from https://alexanderetz.com/2015/04/15/understanding-bayes-a-look-at-the-likelihood/

Rouder, J. (2016, January 24). Roll Your Own: How to Compute Bayes Factors For Your Priors [Blog post]. Retrieved from http://jeffrouder.blogspot.be/2016/01/what-priors-should-i-use-part-i.html