Checking model assumptions? Look at the residuals, not the raw data

Just a quick blogpost to explain this tweet:

If you are in the habit of checking assumptions by looking at raw data, and didn’t know you should be looking at the model residuals instead, don’t feel bad. You are already doing two things that are good: (1) looking at your raw data; (2) caring about modelling assumptions.

Before going further, I would like to mention two additional relevant blogposts that were called to my attention. My tweet was about multilevel data (i.e., the case where you have multiple observations per subject) but, as a few people pointed out, the same holds for single-level data (one observation per subject). Here is a blogpost that explains why. This post also suggests that looking at raw distributions for single level data might not be so bad, but only for the depressing reason that your effects are probably so tiny that they can only be revealed by special elite researchers high in a particular scientific quality known as “flair”. In any event, this take-away message is not applicable to multilevel data, which is probably like 90% of all data in psychology anyway.

The other blogpost, the brilliantly titled Checking model assumptions without getting paranoid, takes for granted the premise I will argue for here — namely, that you should be looking at your residuals rather than the raw data. If it’s already obvious to you why you should be doing that, skip this post and go read Jan’s. It is very clear and informative and will cure you of your paranoia that your data always sucks. If it’s not obvious, play around with my sparkly new shiny app below and discover for yourself why you shouldn’t be messing around with raw data.

An example

I want to keep this as simple and as brief as possible, so we will look at a very basic psychological example involving simple reaction time (RT). Let’s imagine you ran an experiment where you merely asked participants to press a button as quickly as possible in response to a dot appearing on the screen. Each of five participants did this task 500 times, and you measured their response times (in milliseconds). So you have a dataset with 2500 observations, and this dataset is multilevel in the sense that each participant contributes 500 observations toward the larger set.

People vary, so each of your five different participants will have different mean response times: some will be fast, others will be slow. So if you look at the raw data, you are looking at a mixture of five different distributions. The chances of this overall distribution looking anything like a normal distribution are very slim. You will see spooky things like multimodality and skewness. And if you apply the Shapiro-Wilk test of normality, it will inevitably confirm — with the ring of scientific truth only obtainable by p<.05 — that your data most definitely suck. Probably, your stats program will lack a sufficient number of decimal places in its output to express how truly sucky it is.

But this is all needless panic. The only distribution that you need to check is that of the model residuals. (But first you need a model!)

A multilevel model for these data is very simple:

$Y_{ij} = \mu + S_{i} + e_{ij}$

which says, “the $j$th response for subject $i$ equals a grand mean $\mu$ plus a random offset for subject $i$ plus residual error.”

In lme4 formula syntax, that would be:

Y ~ (1 | subject)

i.e., the simplest possible multilevel model, with just an fixed intercept (grand mean) and a by-subject random intercept.

There are two key distributional assumptions for this model:

1. $S_i \sim N(0, \tau^2)$
2. $e_{ij} \sim N(0, \sigma^2)$.

In plain English, the first assumption is that the by-subject offsets (random intercepts) are drawn from a normal distribution with a mean of zero and variance $\tau^2$. You can think of a random intercept as the difference between a subject’s true mean RT and the population’s true mean RT.

In plain English, the second assumption is that the residuals (what’s left from each observation after subtracting the grand mean and the random intercept) are drawn from a normal distribution with mean zero and variance $\sigma^2$.

Note that there is no assumption of the $Y_{ij}$s being normally distributed, i.e. that $Y_{ij} \sim N(0, \omega^2)$!

The app below focuses on the second assumption. [It is probably a good idea to check the other assumption as well, but this doesn’t make sense when you have only 5 participants as we do in this simplified example.]

Shiny web app

OK, math is fun, but now it’s time to play around with some simulated data. The data in the app were generated according to the above multilevel model, so the residuals are always drawn for a normal distribution. You can vary the subject means independently by moving the sliders, and see what happens to the two distributions. Below each plot you can see the results of a Shapiro-Wilk test. To see how the raw data reflects a mixture of five different distributions, click “Reveal subject distributions”. Slide the means around to try to make the raw data look multimodal or skewed. See what weird patterns you can create. Note that in every case, the residuals remain normally distributed: the model assumptions are fully satisfied.

Have fun, and when you’re finished, my take-home message is below.

https://dalejbarr.shinyapps.io/raw_vs_resids/

Take home message

You can’t really assess whether your distributional assumptions are satisfied by looking at the raw data; instead, you should be looking at the model residuals. This is especially true for multilevel data. The raw data distribution will be a mixture distribution. It can look lumpy and weird, while the residuals are perfectly normal. Or, it can even look relatively normal while the residuals look weird (although this is far less likely).

To be clear, I’m not saying you shouldn’t look at your raw data. Look at your data early and often. Look at the raw data. Look at the residuals. Look at the model predictions. Just understand what the data can and cannot tell you at each stage. Raw data doesn’t provide much more than a rough-and-ready sanity check on data quality. More likely than not, it won’t look normally distributed, especially if it is multilevel. So, don’t panic!