This post is a “blog version” of the vignette of my first R package, which is itself greatly inspired from the first post of this blog.

# Installation

You can either install the last release of `ESTER`

on the CRAN

`install.packages("ESTER")`

Or the development version from Github

```
if(!require(devtools)){install.packages("devtools")}
devtools::install_github("lnalborczyk/ESTER")
```

# Theoretic background

## Model comparison

We can outline three general principles guiding model-based inference in science:

**Simplicity and parsimony**(Occam’s razor). Model selection is a classic bias-variance trade-off.**Multiple working hypotheses**(Chamberlin, 1890). At any point in time there must be several hypotheses (models) under consideration, but number of alternatives should be kept small.**Strength of evidence**. Providing quantitative information to judge the*strength of evidence*is central to science (e.g., see Royall, 1997).

Steps of the model selection approach consist in establishing a set of \(R\) relevant models, then, based on data, ranking models (and attributing them weights), then, choosing the best model among the set, and finally making inferences from this best model. To rank models, we need tools that account for the basic principles (presented above) that makes a model a good model.

## Information criteria

### Akaike Information Criterion

In 1951, Kullback & Leibler published a now-famous paper that quantified the meaning of *information* as related to Fisher’s concept of sufficient statistics (Kullback & Leibler, 1951). They developped the Kullback-Leibler divergence (or *K-L information*) that measures the information that is lost when approximating a distribution with another distribution. K-L information can also be used to estimate the *distance* between the *full reality* and a specific model (Burnham & Anderson, 2004).

Two decades later, Akaike (1973) showed that this distance can be estimated by finding the parameters values that maximizes the probability of the data given the model. He used this relationship to derive a criterion known as the Akaike information criterion (AIC):

\[\text{AIC} = -2\log(\mathcal{L}(\hat{\theta}|\text{data}))+2K\]

where \(\theta\) is the set of model parameters, \(\mathcal{L}(\hat{\theta}|\text{data})\) is the likelihood of the candidate model given the data when evaluated at the MLE of \(\theta\), and \(K\) is the number of estimated parameters in the candidate model. The first component, \(-2log(\mathcal{L}(\hat{\theta}|\text{data}))\) is known as the **deviance** of the candidate model.

Basically, we can see that the AIC accounts for the goodness-of-fit of a model (i.e., the strength of evidence for this model), but penalizes it for having too much parameters, that is for not being parcimonious. Clearly, the smaller AIC, the better.

It is important to acknowledge that when \(n/K\) is superior to about 40, the “small sample AIC” (second-order bias correction), called AICc, should be used (Burnham & Anderson, 2004)^{1}.

\[\text{AICc} = \text{AIC}+\dfrac{2K(K+1)}{n-K-1}\]

### Bayesian Information Criterion

The Bayesian Information Criterion (BIC), was introduced by Schwarz (1978) as a competitor to the AIC. Schwarz derived the BIC to serve as an asymptotic approximation to a transformation of the Bayesian posterior probability of a candidate model. The computation of BIC is based on the empirical log-likelihood and does not require the specification of priors.

\[\text{BIC}=-2\log(\mathcal{L}(\hat{\theta}|\text{data}))+K\log(n)\]

We can see that the AIC and the BIC measure the same goodness-of-fit but the penalty term of the BIC is more stringent than the penalty term of the AIC (for \(n>8\), \(k\cdot\log\) exceeds \(2k\)). Consequently, the BIC tends to favor smaller models than the AIC.

## Toward Akaike weights and evidence ratios

The individual AIC, AICc, or BIC values are not interpretable in absolute terms, as they contain arbitrary constants and are much affected by sample size. Thus, it is imperative to rescale these criteria. Usually, this is done by substracting to the IC value of each model the IC of the model with the minimum IC value:

\[\Delta_{IC}=IC_{i}-IC_{min}\]

where \(IC_{min}\) is the minimum of the \(R\) different \(IC_{i}\) values. This transformation forces the best model to have \(\Delta_{IC}=0\), while the rest of the models have positive values.

For the AIC, the simple transformation \(exp(−\Delta_{i}/2)\), for \(i = 1, 2, ..., R\), provides the likelihood of the model given the data. This is a likelihood function over the model set in the sense that \(\mathcal{L}(\theta|data, g_i)\) is the likelihood over the parameter space (for model \(g_i\)) of the parameter \(\theta\), given the data (\(x\)) and the model (\(g_i\)).

It is convenient to normalise the model likelihoods such that they sum to 1 and treat them as probabilities. Hence, we use:

\[w_{i}=\dfrac{exp(-\Delta_{i}/2)}{\sum_{r=1}^{R}exp(-\Delta_{r}/2)}.\]

The \(w_i\), called the **Akaike weights**, are useful as the *weight of evidence* in favor of model \(g_i(\cdot |\theta)\), as being the actual Kullback-Leibler best model in the set. The ratios \(w_i/w_j\) are identical to the original likelihood ratios, \(\mathcal{L}(g_i|data)/\mathcal{L}(g_j|data)\), and so they are invariant to the model set, but the \(w_i\) values depend on the full model set (because they sum to 1).

Evidence can be judged by the relative likelihood of model pairs as \(\mathcal{L}(g_i|x)/\mathcal{L}(g_j|x)\) or, equivalently, the ratio of Akaike weights \(w_i/w_j\). Such ratios are called **evidence ratios** and represent the evidence about fitted models as to which is better in a Kullback-Leibler information sense.

# Sequential testing

The term *sequential analysis* simply refers to the process by which one collects data until to reach a predefined level of evidence. Implementations of this idea are known as the *Probability Ratio Test* (Wald & Wolfowitz, 1948), or the *group sequential designs* in the NHST framework (for an overview, see Lakens, 2014). More recently, Schönbrodt et al. (2017) and Schönbrodt & Wagenmakers (2017) have (re)introduced sequential testing procedures using Bayes Factors, in order to avoid the pitfalls of sequential testing in the NHST framework.

# Sequential testing with evidence ratios

`ESTER`

aims to address at least two different questions. On the first hand, it allows the exploration of the characteristics of the evidence ratios computed from the Akaike weights, when these are used in a sequential way. To this end, the `simER`

function allows exploration of these characteristics for evidence ratios computed from the AIC (or AICc) and the BIC.

On the other hand, `ESTER`

is also designed to run sequential analyses on observed data, which is implemented in the `seqER`

function. To sum, two main questions are adressed.

**Simulation**. Given an effect size and a sample size, what should I reasonnably expect ?**Observed data**. When to stop recruiting participants ?

## Simulation

This `simER`

function runs a simulated study in which we compare two independant groups, for various effect size (`cohensd`

) and sample size (`nmax`

). The `nmin`

argument serves to specify from which participant we want to start doing sequential testing (we usually recommend to avoid `nmin`

< 10). We can define a `boundary`

at which we would like to stop the sequential testing, as well as how many simulations we want to evaluate (`nsims`

).

```
library(ESTER)
ER <- simER(cohensd = 0.8, nmin = 20, nmax = 100, boundary = 10, nsims = 100, ic = bic)
```

```
## [1] "Simulation started at 2017-11-12 18:52:18"
## [1] "Simulation finished at 2017-11-12 18:52:44"
## Duration: Time difference of 26.47106 secs
```

`plot(ER, log = TRUE, hist = TRUE)`

Under the hood, `simER`

uses the `ictab`

function, which allows to compare a set of models fitted to the same data. This function returns either the AIC or the BIC of each model, along with the number of parameters \(k\) and the Akaike weights^{2}.

```
data(mtcars)
mod1 <- lm(mpg ~ cyl, mtcars)
mod2 <- lm(mpg ~ cyl + disp, mtcars)
mod3 <- lm(mpg ~ cyl * disp, mtcars)
mods <- list(m1 = mod1, m2 = mod2, m3 = mod3)
ictab(mods, bic)
```

```
## modnames ic k delta_ic ic_wt
## 3 m3 166.4781 5 0.0000 0.9388
## 2 m2 173.0086 4 6.5305 0.0359
## 1 m1 173.7036 3 7.2256 0.0253
```

Note that in these three functions, the information criterion has to be passed in lowercase, as `aic`

and `bic`

refer to functions from `ESTER`

^{3}.

## Observed data

On the other hand (and perhaps more interestingly), `ESTER`

can be used to do sequential testing on your own data. You can study the evolution of sequential ERs using the `seqER`

function.

```
data(mtcars)
mod1 <- lm(mpg ~ cyl, mtcars)
mod2 <- lm(mpg ~ cyl + disp, mtcars)
plot(seqER(ic = bic, mod1, mod2, nmin = 10) )
```

In addition, `seqER`

allows you to study the behavior of sequential ERs computed on your own data, along with sequential ERs computed on permutation samples. This feature might be useful –for instance– to study to what extent the evolution of evidence ratios you observed on the original sample was dependant to the order of the observations.

`plot(seqER(ic = bic, mod1, mod2, nmin = 10, nsims = 10) )`

When data involve repeated measures (and so multiple lines by subject), it is important to help `ESTER`

to figure out how the data is structured by indicating in which column are stored subject or observations numbers. For instance, below we use data from `lme4`

, and indicate that the subject number is stored in the `"Subject"`

column of the dataframe, by passing it to the `id`

argument of `seqER`

.

```
library(lme4)
data(sleepstudy)
mod1 <- lmer(Reaction ~ Days + (1|Subject), sleepstudy)
mod2 <- lmer(Reaction ~ Days + I(Days^2) + (1|Subject), sleepstudy)
plot(seqER(ic = bic, mod1, mod2, nmin = 10, id = "Subject", nsims = 10) )
```

If no information is passed to the `id`

argument, `seqER`

and `seqER`

will suppose that there is only one observation (i.e., one line) per subject.

# References

## Click to expand

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Caski (Eds.), Proceedings of the Second International Symposium on Information Theory (pp. 267-281). Budapest: Akademiai Kiado.

Burnham, K. P., & Anderson, D. R. (2002). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (2nd ed). Ecological Modelling.

Burnham, K. P., & Anderson, D. R. (2004). Multimodel Inference: Understanding AIC and BIC in Model Selection. *Sociological Methods & Research, 33*(2), 261–304.

Burnham, K. P., Anderson, D. R., & Huyvaert, K. P. (2011). AIC model selection and multimodel inference in behavioral ecology: Some background, observations, and comparisons. *Behavioral Ecology and Sociobiology, 65*(1), 23–35.

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses. *European Journal of Social Psychology, 44*, 701–710.

Royall, R. (1997) Statistical Evidence: A likelihood paradigm, Chapman and Hall, CRC Press.

Schönbrodt, F. D., & Wagenmakers, E.-J. (2017). Bayes Factor Design Analysis: Planning for compelling evidence. *Psychonomic Bulletin & Review*.

Schönbrodt, F. D., Wagenmakers, E.-J., Zehetleitner, M., & Perugini, M. (2017). Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences. *Psychological Methods, 22*, 322–339.

Schwarz, G. (1978). Estimating the dimension of a model. *Annals of Statistics, 6*, 461-464.

Wagenmakers, E.-J., & Farrell, S. (2004). AIC model selection using Akaike weights. *Psychonomic Bulletin & Review, 11*(1), 192–196.

Wald, A., & Wolfowitz, J. (1948). Optimum character of the sequential probability ratio test. *The Annals of Mathematical Statistics, 19*, 326– 339.