DATAHOWLER is a believer in post-publication peer review, and so in this blog post we will dive into the controversy surrounding a 2013 PNAS study by Barbara Fredrickson and colleagues that showed an association between a particular type of psychological well-being (“eudaimonic well-being”) and gene expression profiles. In short, the study presented evidence that people “striving toward meaning and a noble purpose” in life showed genetic profiles more strongly associated with good physical health. Recently, Nick Brown and colleagues published a critique of this study, documenting a variety of apparent methodological and statistical flaws. The original authors responded in kind, cataloguing perceived errors in the Brown et al. analysis (see this response letter and PDF of presentation slides). Neuroskeptic has also jumped into the fray, bolstering Brown et al.’s case through additional simulations, and Nick Brown has offered further defense of his and his colleagues’ critique on his blog.

The criticisms and simulations of Brown et al. and Neuroskeptic are persuasive, but I think an even stronger case can be made against the original finding by examining the dataset itself and by walking though the steps in the original analysis. One of my main goals here is to cut through the maelstrom of technical discussion and lay bare the simple but critical flaw in Fredrickson et al.’s analysis: namely, *the **inappropriate use of a t-test*. Also, I will introduce a more formal test for statistical independence, and show that Fredrickson et al.’s gene expression data dramatically fails this test.

But merely showing that assumptions have been violated is not enough. Even if a particular analysis has a high false positive rate, this doesn’t imply that any given positive finding using that analysis is itself false. It is important to see whether the original finding stands or falls when a more defensible analysis is used. As we will see below, analyzing the data in the right way gets rid of the evidence for any association between eudaimonic well-being and genomics.

If you want to follow along or reproduce my analyses, all of the R functions are freely available in the package `RR53` on github. The script for the particular analyses presented here are included at the bottom of this blog post. (In passing, I would like to thank Hadley Wickham and his fabulous `devtools,` which made the bundling and sharing of my code extremely easy. Stefan Milton Bache also gets a shout-out for introducing the brilliant pipe operator `%>%`

into R, which you’ll see a lot of in my scripts.) To install the `RR53` package, use these commands:

`library(devtools)`

install_github("RR53","dalejbarr")

Note that in addition to `devtools`

, you will also need to have the `dplyr`

and `tidyr`

packages installed. I will be working with Brown et al.’s version of Fredrickson et al.’s dataset, which is available as a CSV file on the PNAS website.

## Structure of the Fredrickson et al. dataset

The dataset, schematized above, involved two matrices of data: a matrix of gene expression data and a matrix of psychometric data (and control variables). Each row in each matrix represents a separate subject (also indicated by color, for reasons that will become clearer below); each column is a separate measurement.

From the blood samples of 80 adult subjects the researchers obtained measurements of the transcription profiles of 53 different genes. Each gene is a column in the left matrix, and is represented by a different symbol (e.g., box, triangle, diamond).

The main hypothesis was that people higher in *eudaimonic well-being* would show gene transcription profiles associated with better physical health. Fredrickson et al. characterize eudaimonic well-being as “striving toward meaning and a noble purpose beyond simple self-gratification,” which is contrasted with *hedonic well-being*, “the sum of an individual’s positive experiences” (p. 13684). To measure these two constructs, the same subjects filled out a 14-item questionnaire, which were then broken down into two factors, each represented by a single variable in the analysis (*E* and *H* in the figure). The authors also took 15 other demographic and psychological measurements (age, sex, depression, etc.) which would serve as control variables in the analysis (variables *I* through *W *in the figure).

## The “RR53” analysis procedure

Fredrickson et al. perfomed an analysis to test for an association between eudaimonic well-being (variable *E*) and participants’ gene expression profiles. The more statistically-minded readers will recognize this as a problem demanding some kind of multivariate analysis: Unlike the usual case in which you have just one single response variable, you have 53 of them, and you would like to regress this 53-element vector on a set of 17 predictor variables. Fredrickson et al. opted to treat this multivariate problem as a set of 53 independent univariate problems. I will follow Brown et al.’s lead in calling this univariate strategy RR53, for “Repeated Regression x53.” The procedure is schematized below:

The authors individually regressed the expression data for each of the 53 genes (each column in the left matrix) on the two positivity factors plus a set of control variables (the right matrix). The 53 regression formulae are shown in the middle panel of the above figure. The regressions estimated the critical coefficients β1 and β2, associated with the eudaimonic and hedonic factors respectively. For simplicity, we will only focus on the eudaimonic coefficients (the β1s); but keep in mind that the same procedure was done with the hedonic coefficients (the β2s).

The 53 regressions yielded 53 different estimates of the association between eudaimonic well-being and gene activity, which were then combined into a mean value (bottom panel of the figure; note that the β values were contrast coded before being combined, which is not represented in the figure). The authors then used a one-sample t-test to test the null hypothesis that the mean of β1 was non-zero (lower panel of the above figure). The standard errors going into the denominator of the t-statistic were obtained by bootstrap resampling. They compared the observed statistic to a t-distribution with 52 degrees of freedom, and rejected the null with p=.0045.

## What is the problem with this analysis?

A t-test is a univariate parametric test, and parametric tests depend on the assumption that the individual observations are “i.i.d.”: *independently and identically distributed.* But the individual βs are only independent if the analyses used to estimate them are also all independent. If there is any level of correlation among the activity levels of the 53 genes then the βs will also be correlated, because we are regressing them on the exact same predictor values in each of the 53 analyses. (Using bootstrap resampling to estimate the standard errors for the t-test doesn’t solve the problem, because the values being resampled are themselves not independent; the non-independence arises a step before this, in the 53 individual regressions.)

How can we find out whether the activity levels of the genes are correlated? We can derive a correlation matrix that captures all of the pairwise correlations in the activity profiles of our 53 genes (the `R`

function `cor()`

does this). As we are interested in the *strength* of these correlations and not in their direction, we take their absolute values and then calculate a mean. Let’s call this statistic the Mean Absolute Correlation (MAC). The MAC value for the Fredrickson et al. data was .2511. But is this high enough to worry about? How do we know whether it’s higher than what we would expect for independent data?

One appealing way to test this is to use a *resampling* technique in which you transform the data to derive a distribution under the null hypothesis that the data are independent. The gene expression data form a matrix with 80 rows (one for each subject) and 53 columns (one for each gene expression). Thus, a given subject’s gene expression profile is a 53-element vector, with the order of the elements corresponding to the “original” labeling of the genes. If we want to get a sense for the magnitude of correlation we would obtain by chance just from data such as these, we can simply shuffle the elements of each of the row vectors. Our resampling analysis will use the following Monte Carlo procedure (expressed in pseudocode):

repeat many times { foreach subject_row { randomly rearrange the columns } re-calculate the MAC value and store it }

The shuffling of the labels breaks the correlations, and allows us to estimate the distribution of MAC values under the null hypothesis of independence. To the extent the null hypothesis is false, Fredrickson et al.’s data should appear as an outlier.

Fredrickson et al.’s genetic data fail this test in a dramatic way (see above figure). Our MAC value of .2511 for the original data was *never exceeded* in 10000 Monte Carlo simulations, providing extremely strong evidence against the null hypothesis of independence (p<.0001).

## How should the authors have analyzed the data instead?

The presence of the correlations mean that it is not possible to break the multivariate problem down into 53 independent univariate problems. The simulations by Brown et al. and by Neuroskeptic show very clearly that RR53’s violation of statistical independence lead to an alarmingly high false positive rate. What would be a better way to analyze the data?

One possibility is to collapse all of the 53 genetic observations for each subject together into a single mean, and then use simple univariate regression. But this analysis might disadvantage the researchers’ hypothesis, because information could get lost in the aggregation.

There is another technique called *multivariate multiple regression* for dealing with more than one response variable, but I’ve never used it myself, and I get the sense that few other people actually know about it or use it either (the Wikipedia entry on the topic is currently a blank page). A multivariate multiple regression would probably be impractical anyway because it would require the estimation of a very large number of parameters whose dependency structure is unknown.

Fortunately, situations in which it is difficult or impractical to apply a parametric approach are usually the same situations in which nonparametric approaches shine. We can analyze the data using a permutation test, by which we construct a null hypothesis distribution for our test statistic under random transformations to the data itself. Permutation tests, unlike parametric tests, do not require observations to be independent; they make the less stringent assumption that the units being relabeled are *exchangeable* under the null. Essentially this means that the joint probability distribution of the variables is invariant under all different possible re-labelings of the data (see Wikipedia for a more technical definition).

Under the null hypothesis that eudaimonic/hedonic well-being has no relationship to gene expression profiles, the connection between each subject’s psychometric data and gene expression data is arbitrary; any test statistic we observe is just noise. We can get an estimate for the distribution of this noise by re-calculating the test statistic a large number of times, each time randomly pairing rows from the gene matrix with rows from the psychometric matrix (see figure below, where each subject is a single row and is represented by a single color.) Note that the dependency structure of the gene data is fully controlled for in this analysis, because each row of genetic data remains intact regardless of what row it gets connected to in the psychometric matrix.

An advantage to this analysis is that we can do the exact same RR53 analysis that Fredrickson et al. did, up to the point where they derived their t test statistic. Thus, we will be using as our test statistic the very same t-value from the original RR53 analysis. But instead of comparing it to the t distribution with 52 degrees of freedom, we will generate a new distribution for this test statistic by re-calculating the statistic many times after randomly reconnecting the rows across the matrices.

But before we can do this analysis, there is one minor problem we have to address—what do we with the 15 control variables? Randomizing the connections between the rows of the matrices requires these connections being exchangeable under the null. Although the connections between *E* and *H* and the gene data are exchangeable, it is possible that relationships exist between the gene data and the control variables (which include things such as age, sex, smoking, illness, etc.), thus violating exchangeability.

We can solve this problem by residualizing the genetic data on the control variables: that is we regress the data for each gene on the control variables and replace the raw genetic data with the residuals. The links between the residuals (the genetic data *after controlling for variables I–W)* and the psychometric predictor variables *E* and *H* are exchangeable under the null. We re-pair the rows from the two matrices a large number of times (10,000 in the current case), re-computing the t-value after each, which gives us a stable estimate of its null-hypothesis distribution.

To get a p-value, we need to compare our observed t-value of -3.242 to a null-hypothesis distribution (NHD) for the statistic. Fredrickson et al.’s approach, which wrongly assume independence of the 53 analyses, uses as its NHD the *t* distribution with 52 degrees of freedom (blue distribution in the figure). Comparing the observed test statistic to this distribution gives us a p-value of .003. In contrast, if we use as our NHD the one we derived by permuting the data (red distribution), which *does not* *assume statistical independence*, we obtain a p-value of .390. So, using an appropriate statistical analysis, the evidence for the link between eudaimonic well-being and gene expression disappears.

## Why did this happen, and what should be done?

Fredrickson et al. have thus far not responded to Brown et al.’s criticism that their analysis violated statistical independence assumptions. My analysis further strengthens Brown et al.’s case, showing clear, direct evidence for dependency in the genetic data. Moreover, my reanalysis goes beyond this to suggest that the original evidence for a relationship between eudaimonic well-being and gene expression is almost certainly a statistical artifact.

Fredrickson et al. noted in their reply to Brown et al. that they have replicated their findings with a larger sample. But without addressing the flaws in their original analysis, the only thing this could possibly show is the authors’ willingness to repeat their mistakes on an even larger data set.

It should have been clear to anyone with a modicum of statistical training that a t-test is an entirely inappropriate strategy for analyzing multivariate dependent data. This alone should have prevented the paper from being published. How did the editor and reviewers fail to recognize this? Clues can be found by reading the authors’ description of their RR53 procedure and results in the main text (page 13685):

General linear model analyses quantified the association between expression of each of the 53 CTRA contrast genes and levels of hedonic and eudaimonic well-being [each well-being dimension treated as a continuous measure and adjusted for correlation with the other dimension of well-being and for age, sex, race/ethnicity, body mass index (BMI), smoking, alcohol consumption, recent minor illness symptoms, and leukocyte subset prevalence; SI Methods]. Contrast coefficient-weighted association statistics were averaged to summarize the magnitude of association over the entire CTRA gene set. CTRA gene expression varied significantly as a function of eudaimonic and hedonic well-being (Fig. 2A). As expected based on the inverse association of eudaimonic well-being with depressive symptoms, eudaimonic well-being was associated with down-regulated CTRA gene expression (contrast, P = 0.0045). In contrast, CTRA gene expression was significantly up-regulated in association with increasing levels of hedonic well-being (P = 0.0047).

What we see here is a **lack of transparency in the reporting of methods and results.** The critical information that a one-sample t-test was performed is never mentioned in the main text; instead, it is buried in the supplementary materials*. *Departing from convention, the p-values are reported without the observed test statistics nor the degrees of freedom for the test. A reviewer who saw “t(52)=X.XX, p=.0045” instead of just “(contrast, p=.0045)”, would have known that a t-test was performed, that 52 degrees of freedom had been assumed, and therefore would be far more likely to realize that the analysis was inappropriate.

The controversy is not over yet, and it seems doubtful that Fredrickson and colleagues will be able to mount a credible defense against the criticism of statistical dependence. Time will tell whether the scientific record gets corrected. In the meantime, it is illustrative to compare the reaction of Fredrickson’s et al. to that of a different group authors whose PNAS paper also came under criticism for statistical flaws. As discussed on Retraction Watch, once these latter authors were made aware of problems with their analyses, they immediately retracted their article, stating:

Our reanalyses suggest that if adjustments for this confound are applied, the results for our top finding no longer reach experiment-wide significance. Therefore, we feel that the presented findings are not currently sufficiently robust to provide definitive support for the conclusions of our paper, and that an extensive reanalysis of the data is required. The authors have therefore unanimously decided to retract this paper at this time.

Reblogged this on Mark Baxter.

Great work Dale. This just further reinforces my view that we should abandon non-scripted statistical packages (e.g., point and click SPSS) for those that promote (or rather force) transparent data analyses.

Thank you! I agree with your view. Problems like this would also be much rarer and easier to detect if authors had to submit the code and data for their analyses along with every manuscript. It seems to me that our publishing practices are a holdover from times when datasets and analyses were much simpler and could be adequately described in a Method section. Those days are long gone!

It seems to me that in many cases of complex modelling, the reviewing process is mostly about determining whether the author has used (or programmed, eek!) the software correctly. For analyses done in SPSS (etc) then as a minimum the output .SPV file should be supplied to show the reviewer what steps were made. For an R program, the code and dataset are necessary, unless the reviewer is a genius at sight-read debugging.

And of course, a nice side-benefit of documenting your code and data well enough that the reviewer can reproduce what you did, is that anyone else can reproduce it too. Storage costs about $0.05 per gigabyte. Why are we even publishing in journals that don’t make it mandatory to provide data and code?

Actually, one might argue that this case doesn’t represent particularly complex modelling, but until the reviewer has worked out what the methods section says (cf. the extract you quoted from p. 13685), she may well assume that it is.

Very nice analysis showing the theory behind what we were only able to show via brute force enumeration, due to my limited experience of statistics. You have made an old man very happy. 🙂

Thanks!