The Innuendo Machine: Turning scientific lead into journalistic gold

There is an important distinction between saying that something could be true and saying that it is true. This distinction is so important that it is usually marked in a language’s grammar, in a feature known as grammatical mood. Consequently, it is nearly impossible to speak about any anything without taking a stance on its existence. To speak in the realis mood is to make an assertion about the state of the world. When I say, “pigs fly,” (realis mood) I assert that the conditions that must hold to make this statement true actually do hold.   To speak in the irrealis mood is to talk about how things could or might be; such as when I say, “pigs could fly.” These latter irrealis statements have far less impact, because they only commit me to the possibility that something is true.  It’s the realis statements that grab your attention.

There is a magical social process that takes carefully hedged irrealis assertions from the scientific realm and transmutes them into unqualified realis assertions in the journalistic realm, a transmuting of scientific lead into journalistic gold.  I refer to this process as the Innuendo Machine, because for the reader at the end of the chain, the realis assertion is only justified by innuendo: The effect is real because it was published in a top journal by scientists at top institutions; therefore the evidence must have met conventional standards of scientific rigor.  The media hype surrounding a recent paper gives us revealing glimpses into how this machine operates.

abstract

This paper, which suggests that ideology can be communicated to potential sexual partners through body odor, appeared in the American Journal of Political Science, widely recognized as one of the top journals in the field.

Note the odd irrealis mood in the title, committing only to the possibility that people screen ideological compatibility through body odor.  But should top journals be in the business on reporting on things that could be possible?  It could be possible that people suss out ideologically compatible mates based on the lengths of their eyelashes, or telepathically.  Isn’t it the job of academic journals to report on things that might be possible in the world as we know it, and to reduce the uncertainty that stands between “might” and “is”?

In the abstract, the authors were careful to point out that they only “explored the possibility” for such a mechanism, and that they “observed that individuals are more attracted to their ideological concomitants [sic]”. Nowhere does one encounter any realis claim that people are attracted to mates of similar ideologies through body odor.

What is all this hedging about?  As some bloggers have already noted (here and here), the statistical evidence for the effect is very weak.  In the article itself, the authors do not report any p-values (except in the supplementary materials), only test statistics. And the main statistics for their findings from three analyses of the same data are t-values of 1.69, 1.48, and 1.45. The authors note that these t-values are in the right direction and are “consistent with their theoretical expectations.” These t-values correspond to two-tailed p-values of approximately .09, .14, and .15, or, to be extra generous, one-tailed values of .045, .070, and .075. So, the evidence for any relationship is extremely weak; but apparently, not too weak for the American Journal of Political Science.

So, how was this all reported in the media?  Here’s the Washington Post:

wapo

Pure realis mood. All hedging and qualification expunged.

Welcome to the Innuendo Machine

Let’s retrace some of the steps by which the Innuendo Machine operates.

An article’s Discussion section is a useful incubator for realis assertion.  In McDermott et al., this is where we see realis in its embryonic form (p. 5):

First, individuals find the smell of those who are more ideologically similar to themselves more attractive than those endorsing opposing ideologies…

Note that by saying individuals find instead of individuals found, the authors have stealthily set the stage for false generalization, by implying that their effect holds beyond the individuals in the study to a potentially larger population.  The Discussion section is also a useful place to get your audience comfortable with the reality of the effect, by avoiding discussion of alternative explanations and possible weaknesses.  It is far more effective to engage your reader with language that presupposes the effect’s existence.  One strategy is to plant a few media friendly cherry-picked anecdotes (p. 6):

In one particularly illustrative case, a participant asked the experimenter if she could take one of the vials home with her because she thought it was “the best perfume I ever smelled”; the vial was from a male who shared an ideology similar to the evaluator. She was preceded by another respondent with an ideology opposite to the person who provided the exact same sample; this participant reported that that vial had “gone rancid” and suggested it needed to be replaced. In this way, different participants experienced the exact same stimulus in radically different ways only moments apart.

It is also useful to discuss how the presumed mechanism it might operate in a complex world, which not only further presupposes the existence of the phenomenon, but also gives you the opportunity to cite prominent researchers in your field.

As editors and reviewers of such a paper, you might want to set aside any nitpicky qualms about extraordinary claims requiring extraordinary evidence, because by engaging in such skepticism you are probably discouraging creativity and squelching innovation.  No one likes a boring spoilsport!  Evaluate the paper on the basis of how potentially mind-blowing the idea would be (assuming it’s true), the reputation of the authors, and how quickly you would run down the hall to tell your colleagues about the finding (assuming it’s true).

Is the paper accepted yet?  It is?

Next stage: Press release!

The press release is where the rubber really hits the realis.  It is the key cog in the Innuendo Machine.  For publications in top journals, the press release is usually taken care of by the publisher (in this case, Wiley), who are typically more than willing to sex-up the claims to a sufficient level of virality.  For instance, they might say that your finding suggests that

…one of the reasons why so many spouses share similar political views is because they were initially and subconsciously attracted to each other’s body odor.

Ah, “because”, beautiful because!  Ah, the allure of the subconscious mind!  Ah, the thousand domestic mini-dramas that will ensue among spouses with incompatible politics or incompatible body odors!

In contrast to the usual narrative in which a scientist’s cautious claims are exaggerated by the media, here we see it come about through the collusion of:

  • the authors (overgeneralizing claims in the article and in the press release)
  • the editor (giving the paper the journal’s seal of approval despite shaky evidence)
  • the publisher (sexing up the findings for public consumption)

But the Media Puts the Icing on the Cake

As noted above, none of the effects in the article reached statistical significance, at least as conventionally defined.  Gelman even praises the editor for not relying on conventional significance levels (though in fairness, he does think that the article is bunk).  But apparently, the media hasn’t gotten the memo about The New Statistics.  Here’s the Washington Post:

smallbutsignificant

I suppose that is correct, if by “significant” you mean p<.1, one-tailed.  Innuendo in action.

Advertisements

Anatomy of a statistical artifact: Eudaimonic well-being and genomics

DATAHOWLER is a believer in post-publication peer review, and so in this blog post we will dive into the controversy surrounding a 2013 PNAS study by Barbara Fredrickson and colleagues that showed an association between a particular type of psychological well-being (“eudaimonic well-being”) and gene expression profiles. In short, the study presented evidence that people “striving toward meaning and a noble purpose” in life showed genetic profiles more strongly associated with good physical health.  Recently, Nick Brown and colleagues published a critique of this study, documenting a variety of apparent methodological and statistical flaws. The original authors responded in kind, cataloguing perceived errors in the Brown et al. analysis (see this response letter and PDF of presentation slides). Neuroskeptic has also jumped into the fray, bolstering Brown et al.’s case through additional simulations, and Nick Brown has offered further defense of his and his colleagues’ critique on his blog.

The criticisms and simulations of Brown et al. and Neuroskeptic are persuasive, but I think an even stronger case can be made against the original finding by examining the dataset itself and by walking though the steps in the original analysis.  One of my main goals here is to cut through the maelstrom of technical discussion and lay bare the simple but critical flaw in Fredrickson et al.’s analysis: namely, the inappropriate use of a t-test.  Also, I will introduce a more formal test for statistical independence, and show that Fredrickson et al.’s gene expression data dramatically fails this test.

But merely showing that assumptions have been violated is not enough. Even if a particular analysis has a high false positive rate, this doesn’t imply that any given positive finding using that analysis is itself false.  It is important to see whether the original finding stands or falls when a more defensible analysis is used. As we will see below, analyzing the data in the right way gets rid of the evidence for any association between eudaimonic well-being and genomics.

If you want to follow along or reproduce my analyses, all of the R functions are freely available in the package RR53 on github.  The script for the particular analyses presented here are included at the bottom of this blog post.  (In passing, I would like to thank Hadley Wickham and his fabulous devtools, which made the bundling and sharing of my code extremely easy. Stefan Milton Bache also gets a shout-out for introducing the brilliant pipe operator %>% into R, which you’ll see a lot of in my scripts.)   To install the RR53 package, use these commands:

library(devtools)
install_github("RR53","dalejbarr")

Note that in addition to devtools, you will also need to have the dplyr and tidyr packages installed.  I will be working with Brown et al.’s version of Fredrickson et al.’s dataset, which is available as a CSV file on the PNAS website.

Structure of the Fredrickson et al. dataset

datadescription

The dataset, schematized above, involved two matrices of data: a matrix of gene expression data and a matrix of psychometric data (and control variables). Each row in each matrix represents a separate subject (also indicated by color, for reasons that will become clearer below); each column is a separate measurement.

From the blood samples of 80 adult subjects the researchers obtained measurements of the transcription profiles of 53 different genes. Each gene is a column in the left matrix, and is represented by a different symbol (e.g., box, triangle, diamond).

The main hypothesis was that people higher in eudaimonic well-being would show gene transcription profiles associated with better physical health.  Fredrickson et al. characterize eudaimonic well-being as “striving toward meaning and a noble purpose beyond simple self-gratification,” which is contrasted with hedonic well-being, “the sum of an individual’s positive experiences” (p. 13684).  To measure these two constructs, the same subjects filled out a 14-item questionnaire, which were then broken down into two factors, each represented by a single variable in the analysis (E and H in the figure).  The authors also took 15 other demographic and psychological measurements (age, sex, depression, etc.) which would serve as control variables in the analysis (variables I through W in the figure).

The “RR53” analysis procedure

Fredrickson et al. perfomed an analysis to test for an association between eudaimonic well-being (variable E) and participants’ gene expression profiles. The more statistically-minded readers will recognize this as a problem demanding some kind of multivariate analysis: Unlike the usual case in which you have just one single response variable, you have 53 of them, and you would like to regress this 53-element vector on a set of 17 predictor variables. Fredrickson et al. opted to treat this multivariate problem as a set of 53 independent univariate problems.  I will follow Brown et al.’s lead in calling this univariate strategy RR53, for “Repeated Regression x53.”  The procedure is schematized below:

rr53fig

The authors individually regressed the expression data for each of the 53 genes (each column in the left matrix) on the two positivity factors plus a set of control variables (the right matrix). The 53 regression formulae are shown in the middle panel of the above figure.  The regressions estimated the critical coefficients β1 and β2, associated with the eudaimonic and hedonic factors respectively. For simplicity, we will only focus on the eudaimonic coefficients (the β1s); but keep in mind that the same procedure was done with the hedonic coefficients (the β2s).

The 53 regressions yielded 53 different estimates of the association between eudaimonic well-being and gene activity, which were then combined into a mean value (bottom panel of the figure; note that the β values were contrast coded before being combined, which is not represented in the figure).  The authors then used a one-sample t-test to test the null hypothesis that the mean of β1 was non-zero (lower panel of the above figure).  The standard errors going into the denominator of the t-statistic were obtained by bootstrap resampling.  They compared the observed statistic to a t-distribution with 52 degrees of freedom, and rejected the null with p=.0045.

What is the problem with this analysis?

A t-test is a univariate parametric test, and parametric tests depend on the assumption that the individual observations are “i.i.d.”: independently and identically distributed. But the individual βs are only independent if the analyses used to estimate them are also all independent. If there is any level of correlation among the activity levels of the 53 genes then the βs will also be correlated, because we are regressing them on the exact same predictor values in each of the 53 analyses.  (Using bootstrap resampling to estimate the standard errors for the t-test doesn’t solve the problem, because the values being resampled are themselves not independent; the non-independence arises a step before this, in the 53 individual regressions.)

How can we find out whether the activity levels of the genes are correlated?  We can derive a correlation matrix that captures all of the pairwise correlations in the activity profiles of our 53 genes (the R function cor() does this). As we are interested in the strength of these correlations and not in their direction, we take their absolute values and then calculate a mean.  Let’s call this statistic the Mean Absolute Correlation (MAC). The MAC value for the Fredrickson et al. data was .2511. But is this high enough to worry about? How do we know whether it’s higher than what we would expect for independent data?

One appealing way to test this is to use a resampling technique in which you transform the data to derive a distribution under the null hypothesis that the data are independent. The gene expression data form a matrix with 80 rows (one for each subject) and 53 columns (one for each gene expression). Thus, a given subject’s gene expression profile is a 53-element vector, with the order of the elements corresponding to the “original” labeling of the genes. If we want to get a sense for the magnitude of correlation we would obtain by chance just from data such as these, we can simply shuffle the elements of each of the row vectors.  Our resampling analysis will use the following Monte Carlo procedure (expressed in pseudocode):

repeat many times {
   foreach subject_row {
      randomly rearrange the columns
   }
   re-calculate the MAC value and store it
}

The shuffling of the labels breaks the correlations, and allows us to estimate the distribution of MAC values under the null hypothesis of independence.  To the extent the null hypothesis is false, Fredrickson et al.’s data should appear as an outlier.

 

 

gitest

Fredrickson et al.’s genetic data fail this test in a dramatic way (see above figure).  Our MAC value of .2511 for the original data was never exceeded in 10000 Monte Carlo simulations, providing extremely strong evidence against the null hypothesis of independence (p<.0001).

How should the authors have analyzed the data instead?

The presence of the correlations mean that it is not possible to break the multivariate problem down into 53 independent univariate problems.  The simulations by Brown et al. and by Neuroskeptic show very clearly that RR53’s violation of statistical independence lead to an alarmingly high false positive rate. What would be a better way to analyze the data?

One possibility is to collapse all of the 53 genetic observations for each subject together into a single mean, and then use simple univariate regression. But this analysis might disadvantage the researchers’ hypothesis, because information could get lost in the aggregation.

There is another technique called multivariate multiple regression for dealing with more than one response variable, but I’ve never used it myself, and I get the sense that few other people actually know about it or use it either (the Wikipedia entry on the topic is currently a blank page). A multivariate multiple regression would probably be impractical anyway because it would require the estimation of a very large number of parameters whose dependency structure is unknown.

Fortunately, situations in which it is difficult or impractical to apply a parametric approach are usually the same situations in which nonparametric approaches shine. We can analyze the data using a permutation test, by which we construct a null hypothesis distribution for our test statistic under random transformations to the data itself. Permutation tests, unlike parametric tests, do not require observations to be independent; they make the less stringent assumption that the units being relabeled are exchangeable under the null. Essentially this means that the joint probability distribution of the variables is invariant under all different possible re-labelings of the data (see Wikipedia for a more technical definition).

Under the null hypothesis that eudaimonic/hedonic well-being has no relationship to gene expression profiles, the connection between each subject’s psychometric data and gene expression data is arbitrary; any test statistic we observe is just noise. We can get an estimate for the distribution of this noise by re-calculating the test statistic a large number of times, each time randomly pairing rows from the gene matrix with rows from the psychometric matrix (see figure below, where each subject is a single row and is represented by a single color.)  Note that the dependency structure of the gene data is fully controlled for in this analysis, because each row of genetic data remains intact regardless of what row it gets connected to in the psychometric matrix.

bigresampling

An advantage to this analysis is that we can do the exact same RR53 analysis that Fredrickson et al. did, up to the point where they derived their t test statistic. Thus, we will be using as our test statistic the very same t-value from the original RR53 analysis. But instead of comparing it to the t distribution with 52 degrees of freedom, we will generate a new distribution for this test statistic by re-calculating the statistic many times after randomly reconnecting the rows across the matrices.

But before we can do this analysis, there is one minor problem we have to address—what do we with the 15 control variables?  Randomizing the connections between the rows of the matrices requires these connections being exchangeable under the null.  Although the connections between E and H and the gene data are exchangeable, it is possible that relationships exist between the gene data and the control variables (which include things such as age, sex, smoking, illness, etc.), thus violating exchangeability.

We can solve this problem by residualizing the genetic data on the control variables: that is we regress the data for each gene on the control variables and replace the raw genetic data with the residuals. The links between the residuals (the genetic data after controlling for variables I–W) and the psychometric predictor variables E and H are exchangeable under the null. We re-pair the rows from the two matrices a large number of times (10,000 in the current case), re-computing the t-value after each, which gives us a stable estimate of its null-hypothesis distribution.

 

nhd

To get a p-value, we need to compare our observed t-value of -3.242 to a null-hypothesis distribution (NHD) for the statistic.  Fredrickson et al.’s approach, which wrongly assume independence of the 53 analyses, uses as its NHD the t distribution with 52 degrees of freedom (blue distribution in the figure).  Comparing the observed test statistic to this distribution gives us a p-value of .003.  In contrast, if we use as our NHD the one we derived by permuting the data (red distribution), which does not assume statistical independence, we obtain a p-value of .390.  So, using an appropriate statistical analysis, the evidence for the link between eudaimonic well-being and gene expression disappears.

Why did this happen, and what should be done?

Fredrickson et al. have thus far not responded to Brown et al.’s criticism that their analysis violated statistical independence assumptions. My analysis further strengthens Brown et al.’s case, showing clear, direct evidence for dependency in the genetic data.  Moreover, my reanalysis goes beyond this to suggest that the original evidence for a relationship between eudaimonic well-being and gene expression is almost certainly a statistical artifact.

Fredrickson et al. noted in their reply to Brown et al. that they have replicated their findings with a larger sample.  But without addressing the flaws in their original analysis, the only thing this could possibly show is the authors’ willingness to repeat their mistakes on an even larger data set.

It should have been clear to anyone with a modicum of statistical training that a t-test is an entirely inappropriate strategy for analyzing multivariate dependent data. This alone should have prevented the paper from being published.   How did the editor and reviewers fail to recognize this?  Clues can be found by reading the authors’ description of their RR53 procedure and results in the main text (page 13685):

General linear model analyses quantified the association between expression of each of the 53 CTRA contrast genes and levels of hedonic and eudaimonic well-being [each well-being dimension treated as a continuous measure and adjusted for correlation with the other dimension of well-being and for age, sex, race/ethnicity, body mass index (BMI), smoking, alcohol consumption, recent minor illness symptoms, and leukocyte subset prevalence; SI Methods]. Contrast coefficient-weighted association statistics were averaged to summarize the magnitude of association over the entire CTRA gene set. CTRA gene expression varied significantly as a function of eudaimonic and hedonic well-being (Fig. 2A). As expected based on the inverse association of eudaimonic well-being with depressive symptoms, eudaimonic well-being was associated with down-regulated CTRA gene expression (contrast, P = 0.0045).  In contrast, CTRA gene expression was significantly up-regulated in association with increasing levels of hedonic well-being (P = 0.0047).

What we see here is a lack of transparency in the reporting of methods and results.  The critical information that a one-sample t-test was performed is never mentioned in the main text; instead, it is buried in the supplementary materials.  Departing from convention, the p-values are reported without the observed test statistics nor the degrees of freedom for the test.  A reviewer who saw “t(52)=X.XX, p=.0045” instead of just “(contrast, p=.0045)”, would have known that a t-test was performed, that 52 degrees of freedom had been assumed, and therefore would be far more likely to realize that the analysis was inappropriate.

The controversy is not over yet, and it seems doubtful that Fredrickson and colleagues will be able to mount a credible defense against the criticism of statistical dependence.  Time will tell whether the scientific record gets corrected.  In the meantime, it is illustrative to compare the reaction of Fredrickson’s et al. to that of a different group authors whose PNAS paper also came under criticism for statistical flaws.  As discussed on Retraction Watch, once these latter authors were made aware of problems with their analyses, they immediately retracted their article, stating:

Our reanalyses suggest that if adjustments for this confound are applied, the results for our top finding no longer reach experiment-wide significance. Therefore, we feel that the presented findings are not currently sufficiently robust to provide definitive support for the conclusions of our paper, and that an extensive reanalysis of the data is required. The authors have therefore unanimously decided to retract this paper at this time.

Appendix: Code for the analysis