For multilevel data, don’t assess distributional assumptions by plotting raw data. Sure, look at it, but keep in mind that the it can look wonky only because it’s a mixture distribution. What you really should be looking at are the *model residuals* #ThingsTheyDontTeachInStat101

— Dale Barr (@dalejbarr) August 2, 2018

If you are in the habit of checking assumptions by looking at raw data, and didn’t know you should be looking at the model residuals instead, don’t feel bad. You are already doing two things that are good: (1) looking at your raw data; (2) caring about modelling assumptions.

Before going further, I would like to mention two additional relevant blogposts that were called to my attention. My tweet was about multilevel data (i.e., the case where you have multiple observations per subject) but, as a few people pointed out, the same holds for single-level data (one observation per subject). Here is a blogpost that explains why. This post also suggests that looking at raw distributions for single level data might not be so bad, but only for the depressing reason that your effects are probably so tiny that they can only be revealed by special elite researchers high in a particular scientific quality known as “flair”. In any event, this take-away message is not applicable to multilevel data, which is probably like 90% of all data in psychology anyway.

The other blogpost, the brilliantly titled Checking model assumptions without getting paranoid, takes for granted the premise I will argue for here — namely, that you should be looking at your residuals rather than the raw data. If it’s already obvious to you why you should be doing that, skip this post and go read Jan’s. It is very clear and informative and will cure you of your paranoia that your data always sucks. If it’s not obvious, play around with my sparkly new shiny app below and discover for yourself why you shouldn’t be messing around with raw data.

### An example

I want to keep this as simple and as brief as possible, so we will look at a very basic psychological example involving simple reaction time (RT). Let’s imagine you ran an experiment where you merely asked participants to press a button as quickly as possible in response to a dot appearing on the screen. Each of five participants did this task 500 times, and you measured their response times (in milliseconds). So you have a dataset with 2500 observations, and this dataset is multilevel in the sense that each participant contributes 500 observations toward the larger set.

People vary, so each of your five different participants will have different mean response times: some will be fast, others will be slow. So if you look at the raw data, you are looking at a mixture of five different distributions. The chances of this overall distribution looking anything like a normal distribution are very slim. You will see spooky things like multimodality and skewness. And if you apply the Shapiro-Wilk test of normality, it will inevitably confirm — with the ring of scientific truth only obtainable by p<.05 — that your data most definitely suck. Probably, your stats program will lack a sufficient number of decimal places in its output to express how truly sucky it is.

But this is all needless panic. The only distribution that you need to check is that of the **model residuals**. (But first you need a model!)

A multilevel model for these data is very simple:

which says, “the th response for subject equals a grand mean plus a random offset for subject plus residual error.”

In `lme4`

formula syntax, that would be:

`Y ~ (1 | subject)`

i.e., the simplest possible multilevel model, with just an fixed intercept (grand mean) and a by-subject random intercept.

There are two key distributional assumptions for this model:

- .

In plain English, the first assumption is that the by-subject offsets (random intercepts) are drawn from a normal distribution with a mean of zero and variance . You can think of a random intercept as the difference between a subject’s true mean RT and the population’s true mean RT.

In plain English, the second assumption is that the residuals (what’s left from each observation after subtracting the grand mean and the random intercept) are drawn from a normal distribution with mean zero and variance .

Note that there is **no** assumption of the s being normally distributed, i.e. that !

The app below focuses on the second assumption. [It is probably a good idea to check the other assumption as well, but this doesn’t make sense when you have only 5 participants as we do in this simplified example.]

### Shiny web app

OK, math is fun, but now it’s time to play around with some simulated data. The data in the app were generated according to the above multilevel model, so the residuals are always drawn for a normal distribution. You can vary the subject means independently by moving the sliders, and see what happens to the two distributions. Below each plot you can see the results of a Shapiro-Wilk test. To see how the raw data reflects a mixture of five different distributions, click “Reveal subject distributions”. Slide the means around to try to make the raw data look multimodal or skewed. See what weird patterns you can create. Note that in every case, the residuals remain normally distributed: the model assumptions are fully satisfied.

Have fun, and when you’re finished, my take-home message is below.

### Take home message

You can’t really assess whether your distributional assumptions are satisfied by looking at the raw data; instead, you should be looking at the model residuals. This is especially true for multilevel data. The raw data distribution will be a mixture distribution. It can look lumpy and weird, while the residuals are perfectly normal. Or, it can even look relatively normal while the residuals look weird (although this is far less likely).

To be clear, I’m not saying you shouldn’t look at your raw data. Look at your data early and often. Look at the raw data. Look at the residuals. Look at the model predictions. Just understand what the data can and cannot tell you at each stage. Raw data doesn’t provide much more than a rough-and-ready sanity check on data quality. More likely than not, it won’t look normally distributed, especially if it is multilevel. So, **don’t panic!**

Thanks for the post – I’ll have to digest it but it is really interesting.

Currently, I am trying to think of how to model the data that is reporting in psychology journals. This is so that I can try to get a feel for whether two (or more) groups differ in a meaningful (rather than significant) way. Often, there is no plot, or plots, of data and the most commonly reported statistics are means and standard deviations (before going onto inferential statistics). I understand the standard deviation is informative for how “spread out” the data are around the mean. But, when I try to picture this in my head, I get a little stuck.

So, I had the idea of taking reported means and standard deviations in a paper, and using a normal distribution function to draw the probability distribution function (PDF) for each group. Plotting these PDFs on the same plot (I hope) shows what the two groups data would look like if it was approx. normally distributed. I know this would be a very rough guide, with many assumptions.

For example, group A has a mean of 4.25 (SD=0.50) and group B has a mean of 4.71 (SD=0.60). These data are significantly different. But, the mean difference is less than 0.5 of a point (on a 5 point Liker scale) and are roughly equivalent in dispersal. Looking at the means and standard deviations I would expect the data to be overlapping to a high degree, and this is what I see as a PDF plot for both groups (based on a normal distribution).

Do you think this is an acceptable way to try and get a better ‘eyeball’ for data? Or if not, could you recommend another way, or ways?

Thanks.

This seems fine. There are also some apps where you can put in an effect size like Cohen’s d and it will plot the overlap in the distributions (like this one: http://rpsychologist.com/d3/cohend/). But for really understanding the data, I am a big believer in simulation. So, I would write code to generate data from a Likert scale with the desired means and SDs. This reverse engineering of raw data from SDs and means is something that James Heathers and Nick Brown have been working on with their SPRITE algorithm, so you might want to have a look at their work.

Thanks for the speedy reply!

Thanks for the link – that is a really handy resource. I will definitely check out the SPRITE algorithm, I was wondering about how you could simulate some datasets.

I appreciate your help.

Pingback: Meta-Aggregator 08.13.2018

Great! Thanks a lot! Really very helpful! I have been confused about these issues for a long time