Sequential Testing with Information Criteria and Evidence Ratios

Ladislas Nalborczyk
Univ. Grenoble Alpes, CNRS, LPNC (France) • Ghent University (Belgium)

The power problem

How many participants/observations should I obtain for my next experiment, in order to show the effect I want to show ?

Usual algorithm (a priori NHST-PA): trying to guess the population effect size, plugging numbers (effect size and requested power) in the magic hat (the power function), and planning for a specific sample size.

plot of chunk unnamed-chunk-1

Sequential testing

But if I knew what (effect size) I am looking for, I would not bother making an experiment, right ?

Another approach consists in recruiting participants until we reach a predefined level of evidence. This is called sequential testing.

But this requires that we have a way to quantify evidence…

Sequential Bayes Factor

Example of the Sequential Bayes Factor procedure as presented in Schönbrodt, Wagenmakers, Zehetleitner, & Perugini (2017) and Schönbrodt & Wagenmakers (2017).

Sequential Bayes Factor

Statistical toolbox

Quantifying the relative evidence for an hypothesis/model

↳ Bayes Factors

Making decisions while controlling error rates in the long-run

↳ NHST & p-values (à la Neyman-Pearson)

Comparing the (out-of-sample) predictive abilities of models

↳ Information Criteria (e.g., AIC, WAIC)

AIC and deviance

Hirotugu Akaike noticed that the negative log-likehood of a model + 2 times its number of parameters was approximately equal to the out-of-sample deviance of a model…

\[ \text{AIC} = \underbrace{-2\log(\mathcal{L}(\hat{\theta}|\text{data}))}_{deviance} + 2K \]

In-sample deviance: how bad is a model to explain the current dataset (the dataset that we used to fit the model)

Out-of-sample deviance: how bad is a model to explain a future dataset issued from the same data generation process (the same population)

This is crazy no ?

Akaike weights and evidence ratios

The individual AIC values are not interpretable in absolute terms as they contain arbitrary constants. We usually rescale them by substracting to the AIC of each model \( i \) the AIC of the model with the minimum one:

\[ \Delta_{AIC} = AIC_{i} - AIC_{min} \]

It is convenient to normalise the model likelihoods such that they sum to 1 and that we can treat them as probabilities.

\[ w_{i}=\dfrac{exp(-\Delta_{i}/2)}{\sum_{r=1}^{R}exp(-\Delta_{r}/2)} \]

Akaike weights and evidence ratios

The \( w_i \), called Akaike weights, are useful as the weight of evidence in favour of model \( g_i(\cdot |\theta) \), as being the actual Kullback-Leibler best model in the set (the K-L best model).

Evidence can be judged by taking the ratio of Akaike weights \( w_i/w_j \). Such ratios are called evidence ratios (ERs) and represent the evidence about fitted models as to which is better in a Kullback-Leibler information sense (i.e., which model has the lowest out-of-sample deviance).

Consistency versus efficiency

BFs and AIC-based ERs have different asymptotic properties (because they answer different questions).

The (default t-test) Bayes Factor is said to be consistent for model identification (i.e., for identifying the true model). The AIC is not consistent for model identification, but it is an efficient estimator of the expected K-L information loss.

What is the goal ? True model versus K-L best model

Simulation studies usually involve comparing two nested models, one of them being the true model (the model from which we generated the data).

mod1 <- lm(y ~ 1, data) # "true" model when effect size = 0
mod2 <- lm(y ~ 1 + x, data) # "true" model when effect size != 0

This situation is nicely addressed by using Bayes Factors (or equivalently, using the BIC), which aim at identifying the true model in the set of models.

However, the goal of identifying the true model assumes a true model actually being in the set of models… Alternatively, one could rather be interested in identifying the K-L best model (the closest model to the full reality).

True model versus K-L best model

As an example, we are going to simulate data from the following model, which assumes tapering effect sizes. Let \( x_{1}, \dots, x_{8} \) be independent \( \mathrm{Normal}(0, 1) \) random variables. Given the \( x_{i} \), let

\[ y = 100 + 10 x_{1} + 5 x_{2} + 1 x_{3} + 0.5 x_{4} + 0.1 x_{5} + 0.05 x_{6} + 0.025 x_{7} + 0.01 x_{8} + \epsilon, \epsilon \sim \mathrm{Normal}(0, 1) \]

plot of chunk unnamed-chunk-3

Then, we can compare two nested models (neither of them being the true model).

mod1 <- lm(y ~ 1 + x1 + x2 + x3 + x4, data)
mod2 <- lm(y ~ 1 + x1 + x2 + x3 + x4 + x5, data)

True model versus K-L best model

type	boundary	ASN	Lower_hit	Upper_hit	Inconclusive
BF	3	50.41	98.8%	1.2%	0%
BF	6	53.19	98.9%	1.1%	0%
BF	10	59.69	98%	1.8%	0.2%
ER	3	107.62	60.9%	38.1%	1%
ER	6	432.66	0%	88.9%	11.1%
ER	10	503.02	0%	82.9%	17.1%

Final thoughts

In many situations, the SBF and the AIC-based procedures will select the same model.

However, the properties of model selection based on Bayes Factors or BIC are usually evaluated using simulations with the goal of identifying the true model.

One could question whether it is sensible to assume a true model (an oxymoron) in real life, especially in social sciences (e.g., Burnham & Anderson, 2002; 2004). More realistic simulation studies should evaluate the properties of these procedures when the true model is not in the set of models.

As Finley (1985) notes: “[…] consistency can be an undesirable property in the context of selecting a model”. In other words, the goal of identifying the true model works fine… when there is a true model to be identified.

Final thoughts

An illustrative example, comparing two brms models using WAIC.

library(sticer)
library(lmisc)
library(brms)
data(brinley_data)
mod1 <- brm(outcome ~ 1 + (1|participant), brinley_data)
mod2 <- brm(outcome ~ 1 + session + (1|participant), brinley_data)

sticer::seqER(ic = WAIC, mod1, mod2, nmin = 20, id = "participant") %>%
    plot + theme_bw(base_size = 18) + geom_hline(yintercept = 1, linetype = 2)

plot of chunk unnamed-chunk-6

Thanks

I am deeply indebted to Brice Beffara, Paul Bürkner, & Thierry Phénix for many fascinating conversations about model comparison and the way we think about information criteria.

lnalborczyk lnalborczyk osf.io/ba8xt www.barelysignificant.com