No more excuses: R is better than SPSS for psychology undergrads, and students agree

Starting this academic year (2016/7), all of our stats teaching in Psychology at the University of Glasgow will use R and RStudio instead of Excel and SPSS.

When we proposed the transition, some staff worried that the scripting nature of R would be too challenging for incoming students, many of whom would be starting the program with little or no programming experience.  Staff at other universities who have also been leading similar transitions to R have also told me that they have faced similar pessimism from teaching staff.

I did not share these doubts, because for six years, I have been giving students the choice between R and SPSS in my level 3 course, and found that those who chose to use R and RStudio were mostly able to get up and going on their own, even without very much guidance from me, beyond sending them links to online tutorials. To be sure, students found it challenging at first, but these were students who had just gone through two years of the program working in Excel and SPSS, and had become accustomed to the point-and-click nature of these programs.  The truth is that you really aren’t exposed to the quirks of R until you start using it for data wrangling–which generally isn’t part of the psychology stats curriculum anyway (but should be!)  If all you are doing is plugging in pre-formatted, pre-cleaned, canned datasets, cranking out a t-test or an ANOVA and maybe also a bar graph, the software you use does not make a huge difference.  But if that is all you are teaching, your students will be ill-prepared when they first encounter their own messy datasets.

 

This kind of teaching should be a relic of the past, and this is one of the many reasons we decided to move to R: we thought that R’s more interactive, transparent, and reproducible approach to data analysis would be better for learning, and we also wanted to incorporate more data science skills into our curriculum.

We recently piloted some of our new R materials on a group of undergraduates who already had some exposure to SPSS in previous years, which allowed them to compare the two platforms.  Bear in mind that these students had no previous experience with R before the piloting day, and had not yet been subject to my annual brainwashing lecture on the benefits of R at the start of year three.

One of the questions we asked at the end of the session was: “Did you understand how to use the software?” 12/13 students said “yes”, with the remaining student saying “yes and no; with some practice, it should be OK.”

The students’ comments are very illuminating, and I will let them speak for themselves.  They have not been edited or cherry picked.

I hope they encourage other departments to make the transition to R.

R generally is fascinating, and I can absolutely see the benefits of it. I am sure it will make students more critically aware about what they’re doing instead of just following SPSS drop-down menus.

 

R is very easy to use. Typing in code helps with understanding.

 

Already find R much better than SPSS because I feel much more in control and enjoy having a clear oversight over what I’ve been doing, while SPSS just felt like I was clicking random buttons and looking for numbers to write down.

 

I find it more engaging than SPSS.

 

You are able to see exactly what is happening to the stats, you can change the graphs around and if you add a sort of interactive feature, e.g., change the error bars, colour the graph, etc., you can start to see how the code works.

 

You can edit/manipulate the data much easier than with SPSS.  It forced you to engage with it.

 

Different things are on the screen at the same time so you get a complete overview of your stats.  It would help if most relevant results were written out in bold; volume of text can be a bit overwhelming, but I guess that’s how the software is.

 

If you know what to look for, it can really help with solidifying mechanics behind stats.

 

I found the experience very exciting and going through it with the teaching assistant was extremely helpful. I personally prefer R to SPSS and can’t wait to learn it and be able to run it. R looks more timesaving and less confusing than SPSS.

 

Better than pointing and clicking. Coding is much more flexible. But the results of the tests look a bit more confusing than with SPSS.

Advertisements

What the world needs now: Even more R/RStudio instructional videos

Backstory

You might be interested in the backstory for these two R/RStudio instructional videos I created, which gives insight into our department’s recent transition to teaching stats using R/RStudio instead of Excel/SPSS (if not, the TL;DR for this section is that these two videos were borne out of frustration, last-minute panic, and pedagogical role-modeling by my son).

Starting this 2016/7 academic year, all our statistics instruction in psychology is leaving Excel/SPSS behind and moving on to R/RStudio.  We chose R because we want our students to learn how to do make their data analyses reproducible.  The first year of our program will now be devoted to developing basic data science skills: loading in different kinds of datasets, tidying them, merging them with others, visualizing distributions, calculating basic descriptive statistics, and generating reports in RMarkdown.  We needed some kind of tutorial on interacting with R/RStudio that incoming students could work through at their own pace.

We also needed some additional training materials for our teaching staff.  We set aside a day before the start of term where we would pilot our new lab materials with students, almost none of whom had encountered R/RStudio before. The students would therefore need lots of support to get through the exercises.  But in spite of best our efforts to get teaching staff sufficiently trained up, as the pilot day approached, many were still anxious about having to help students use software they themselves were still coming to grips with.

I had developed a series of web-based step-by-step walkthrough documents for our incoming students that I sent out to some of the staff to try out, and although staff politely expressed gratitude for my efforts, I think they found it overwhelming.  Some complained that it took far too much time to get through (and, by the way, was also missing important information).  So clearly the format was not working.

While I had been working on these materials, my son had been spending his last days of summer vacation producing videogame walkthroughs (BTW, if you’re looking for cool Minecraft videos, this 11-year-old has got your back).  So 24 hours before piloting day, with staff panic approaching meltdown levels, I realized through my son’s example that the pedagogical medium I needed all along was video (DUH, Dad!).

 

I needed a video, and I needed it quickly.  I did not have hours to spend watching videos on YouTube to the exact one that would suit my needs, and I quickly realized that it would probably take me less time to make my own than to review the many hours of instructional videos already available, if I limited myself to one take.  After all, wasn’t Sister Ray cut in a single take?  So I thought up an analysis task, launched the video capture software, and hoped for the best.  Judge accordingly.

The staff and students were happy the result, so I decided to make the videos public.  Hopefully others will find these introductory videos useful, especially those just starting out in R.

The videos

The videos provide a demo of R/RStudio in the context of an analysis of Scottish babynames.  I had three goals:

  • Dazzle students with some R black magic so that they get excited about its possibilities, while still giving them the basics;
  • Provide a model of how to interact with R/RStudio in the context of a well-defined analysis task;
  • Choose an analysis project that would be fun and personally relevant to our incoming students.  I had been playing around with the babynames package for one of our homework assignments, and in the process had discovered that the National Records of Scotland has a similar database in CSV format.

I did the analysis twice, once as an R script and once as part of an RMarkdown report, and decided to split the video into two parts.

Video 1: Basic interaction with RStudio, developing an R script

Yes, around a minute of the first video involves me awkwardly watching the readr package installation process, wishing I knew some good R jokes or had some background music to make the time pass more quickly.

Yes, in the video I accidentally reveal that I had been recently been using R to make chickens talk.

Yes, color is probably not the best way to differentiate the names in the graphs.

Yes, at no point in the video do I appear to realize that the trends I am looking at in the videos largely reflect the statistical phenomenon of regression toward the mean, in spite of having published on this very topic.  But at least it did remind me after the fact that we need to discuss this phenomenon somewhere in our curriculum!

The R script for this analysis:

Video 2: RMarkdown and knitting an HTML report

This second part of the video reproduces the analysis in an RMarkdown document and shows how to compile an HTML report.

And the RMarkdown document (click “View Raw” at the bottom right of the window for the RMarkdown source)

The Innuendo Machine: Turning scientific lead into journalistic gold

There is an important distinction between saying that something could be true and saying that it is true. This distinction is so important that it is usually marked in a language’s grammar, in a feature known as grammatical mood. Consequently, it is nearly impossible to speak about any anything without taking a stance on its existence. To speak in the realis mood is to make an assertion about the state of the world. When I say, “pigs fly,” (realis mood) I assert that the conditions that must hold to make this statement true actually do hold.   To speak in the irrealis mood is to talk about how things could or might be; such as when I say, “pigs could fly.” These latter irrealis statements have far less impact, because they only commit me to the possibility that something is true.  It’s the realis statements that grab your attention.

There is a magical social process that takes carefully hedged irrealis assertions from the scientific realm and transmutes them into unqualified realis assertions in the journalistic realm, a transmuting of scientific lead into journalistic gold.  I refer to this process as the Innuendo Machine, because for the reader at the end of the chain, the realis assertion is only justified by innuendo: The effect is real because it was published in a top journal by scientists at top institutions; therefore the evidence must have met conventional standards of scientific rigor.  The media hype surrounding a recent paper gives us revealing glimpses into how this machine operates.

abstract

This paper, which suggests that ideology can be communicated to potential sexual partners through body odor, appeared in the American Journal of Political Science, widely recognized as one of the top journals in the field.

Note the odd irrealis mood in the title, committing only to the possibility that people screen ideological compatibility through body odor.  But should top journals be in the business on reporting on things that could be possible?  It could be possible that people suss out ideologically compatible mates based on the lengths of their eyelashes, or telepathically.  Isn’t it the job of academic journals to report on things that might be possible in the world as we know it, and to reduce the uncertainty that stands between “might” and “is”?

In the abstract, the authors were careful to point out that they only “explored the possibility” for such a mechanism, and that they “observed that individuals are more attracted to their ideological concomitants [sic]”. Nowhere does one encounter any realis claim that people are attracted to mates of similar ideologies through body odor.

What is all this hedging about?  As some bloggers have already noted (here and here), the statistical evidence for the effect is very weak.  In the article itself, the authors do not report any p-values (except in the supplementary materials), only test statistics. And the main statistics for their findings from three analyses of the same data are t-values of 1.69, 1.48, and 1.45. The authors note that these t-values are in the right direction and are “consistent with their theoretical expectations.” These t-values correspond to two-tailed p-values of approximately .09, .14, and .15, or, to be extra generous, one-tailed values of .045, .070, and .075. So, the evidence for any relationship is extremely weak; but apparently, not too weak for the American Journal of Political Science.

So, how was this all reported in the media?  Here’s the Washington Post:

wapo

Pure realis mood. All hedging and qualification expunged.

Welcome to the Innuendo Machine

Let’s retrace some of the steps by which the Innuendo Machine operates.

An article’s Discussion section is a useful incubator for realis assertion.  In McDermott et al., this is where we see realis in its embryonic form (p. 5):

First, individuals find the smell of those who are more ideologically similar to themselves more attractive than those endorsing opposing ideologies…

Note that by saying individuals find instead of individuals found, the authors have stealthily set the stage for false generalization, by implying that their effect holds beyond the individuals in the study to a potentially larger population.  The Discussion section is also a useful place to get your audience comfortable with the reality of the effect, by avoiding discussion of alternative explanations and possible weaknesses.  It is far more effective to engage your reader with language that presupposes the effect’s existence.  One strategy is to plant a few media friendly cherry-picked anecdotes (p. 6):

In one particularly illustrative case, a participant asked the experimenter if she could take one of the vials home with her because she thought it was “the best perfume I ever smelled”; the vial was from a male who shared an ideology similar to the evaluator. She was preceded by another respondent with an ideology opposite to the person who provided the exact same sample; this participant reported that that vial had “gone rancid” and suggested it needed to be replaced. In this way, different participants experienced the exact same stimulus in radically different ways only moments apart.

It is also useful to discuss how the presumed mechanism it might operate in a complex world, which not only further presupposes the existence of the phenomenon, but also gives you the opportunity to cite prominent researchers in your field.

As editors and reviewers of such a paper, you might want to set aside any nitpicky qualms about extraordinary claims requiring extraordinary evidence, because by engaging in such skepticism you are probably discouraging creativity and squelching innovation.  No one likes a boring spoilsport!  Evaluate the paper on the basis of how potentially mind-blowing the idea would be (assuming it’s true), the reputation of the authors, and how quickly you would run down the hall to tell your colleagues about the finding (assuming it’s true).

Is the paper accepted yet?  It is?

Next stage: Press release!

The press release is where the rubber really hits the realis.  It is the key cog in the Innuendo Machine.  For publications in top journals, the press release is usually taken care of by the publisher (in this case, Wiley), who are typically more than willing to sex-up the claims to a sufficient level of virality.  For instance, they might say that your finding suggests that

…one of the reasons why so many spouses share similar political views is because they were initially and subconsciously attracted to each other’s body odor.

Ah, “because”, beautiful because!  Ah, the allure of the subconscious mind!  Ah, the thousand domestic mini-dramas that will ensue among spouses with incompatible politics or incompatible body odors!

In contrast to the usual narrative in which a scientist’s cautious claims are exaggerated by the media, here we see it come about through the collusion of:

  • the authors (overgeneralizing claims in the article and in the press release)
  • the editor (giving the paper the journal’s seal of approval despite shaky evidence)
  • the publisher (sexing up the findings for public consumption)

But the Media Puts the Icing on the Cake

As noted above, none of the effects in the article reached statistical significance, at least as conventionally defined.  Gelman even praises the editor for not relying on conventional significance levels (though in fairness, he does think that the article is bunk).  But apparently, the media hasn’t gotten the memo about The New Statistics.  Here’s the Washington Post:

smallbutsignificant

I suppose that is correct, if by “significant” you mean p<.1, one-tailed.  Innuendo in action.

Anatomy of a statistical artifact: Eudaimonic well-being and genomics

DATAHOWLER is a believer in post-publication peer review, and so in this blog post we will dive into the controversy surrounding a 2013 PNAS study by Barbara Fredrickson and colleagues that showed an association between a particular type of psychological well-being (“eudaimonic well-being”) and gene expression profiles. In short, the study presented evidence that people “striving toward meaning and a noble purpose” in life showed genetic profiles more strongly associated with good physical health.  Recently, Nick Brown and colleagues published a critique of this study, documenting a variety of apparent methodological and statistical flaws. The original authors responded in kind, cataloguing perceived errors in the Brown et al. analysis (see this response letter and PDF of presentation slides). Neuroskeptic has also jumped into the fray, bolstering Brown et al.’s case through additional simulations, and Nick Brown has offered further defense of his and his colleagues’ critique on his blog.

The criticisms and simulations of Brown et al. and Neuroskeptic are persuasive, but I think an even stronger case can be made against the original finding by examining the dataset itself and by walking though the steps in the original analysis.  One of my main goals here is to cut through the maelstrom of technical discussion and lay bare the simple but critical flaw in Fredrickson et al.’s analysis: namely, the inappropriate use of a t-test.  Also, I will introduce a more formal test for statistical independence, and show that Fredrickson et al.’s gene expression data dramatically fails this test.

But merely showing that assumptions have been violated is not enough. Even if a particular analysis has a high false positive rate, this doesn’t imply that any given positive finding using that analysis is itself false.  It is important to see whether the original finding stands or falls when a more defensible analysis is used. As we will see below, analyzing the data in the right way gets rid of the evidence for any association between eudaimonic well-being and genomics.

If you want to follow along or reproduce my analyses, all of the R functions are freely available in the package RR53 on github.  The script for the particular analyses presented here are included at the bottom of this blog post.  (In passing, I would like to thank Hadley Wickham and his fabulous devtools, which made the bundling and sharing of my code extremely easy. Stefan Milton Bache also gets a shout-out for introducing the brilliant pipe operator %>% into R, which you’ll see a lot of in my scripts.)   To install the RR53 package, use these commands:

library(devtools)
install_github("RR53","dalejbarr")

Note that in addition to devtools, you will also need to have the dplyr and tidyr packages installed.  I will be working with Brown et al.’s version of Fredrickson et al.’s dataset, which is available as a CSV file on the PNAS website.

Structure of the Fredrickson et al. dataset

datadescription

The dataset, schematized above, involved two matrices of data: a matrix of gene expression data and a matrix of psychometric data (and control variables). Each row in each matrix represents a separate subject (also indicated by color, for reasons that will become clearer below); each column is a separate measurement.

From the blood samples of 80 adult subjects the researchers obtained measurements of the transcription profiles of 53 different genes. Each gene is a column in the left matrix, and is represented by a different symbol (e.g., box, triangle, diamond).

The main hypothesis was that people higher in eudaimonic well-being would show gene transcription profiles associated with better physical health.  Fredrickson et al. characterize eudaimonic well-being as “striving toward meaning and a noble purpose beyond simple self-gratification,” which is contrasted with hedonic well-being, “the sum of an individual’s positive experiences” (p. 13684).  To measure these two constructs, the same subjects filled out a 14-item questionnaire, which were then broken down into two factors, each represented by a single variable in the analysis (E and H in the figure).  The authors also took 15 other demographic and psychological measurements (age, sex, depression, etc.) which would serve as control variables in the analysis (variables I through W in the figure).

The “RR53” analysis procedure

Fredrickson et al. perfomed an analysis to test for an association between eudaimonic well-being (variable E) and participants’ gene expression profiles. The more statistically-minded readers will recognize this as a problem demanding some kind of multivariate analysis: Unlike the usual case in which you have just one single response variable, you have 53 of them, and you would like to regress this 53-element vector on a set of 17 predictor variables. Fredrickson et al. opted to treat this multivariate problem as a set of 53 independent univariate problems.  I will follow Brown et al.’s lead in calling this univariate strategy RR53, for “Repeated Regression x53.”  The procedure is schematized below:

rr53fig

The authors individually regressed the expression data for each of the 53 genes (each column in the left matrix) on the two positivity factors plus a set of control variables (the right matrix). The 53 regression formulae are shown in the middle panel of the above figure.  The regressions estimated the critical coefficients β1 and β2, associated with the eudaimonic and hedonic factors respectively. For simplicity, we will only focus on the eudaimonic coefficients (the β1s); but keep in mind that the same procedure was done with the hedonic coefficients (the β2s).

The 53 regressions yielded 53 different estimates of the association between eudaimonic well-being and gene activity, which were then combined into a mean value (bottom panel of the figure; note that the β values were contrast coded before being combined, which is not represented in the figure).  The authors then used a one-sample t-test to test the null hypothesis that the mean of β1 was non-zero (lower panel of the above figure).  The standard errors going into the denominator of the t-statistic were obtained by bootstrap resampling.  They compared the observed statistic to a t-distribution with 52 degrees of freedom, and rejected the null with p=.0045.

What is the problem with this analysis?

A t-test is a univariate parametric test, and parametric tests depend on the assumption that the individual observations are “i.i.d.”: independently and identically distributed. But the individual βs are only independent if the analyses used to estimate them are also all independent. If there is any level of correlation among the activity levels of the 53 genes then the βs will also be correlated, because we are regressing them on the exact same predictor values in each of the 53 analyses.  (Using bootstrap resampling to estimate the standard errors for the t-test doesn’t solve the problem, because the values being resampled are themselves not independent; the non-independence arises a step before this, in the 53 individual regressions.)

How can we find out whether the activity levels of the genes are correlated?  We can derive a correlation matrix that captures all of the pairwise correlations in the activity profiles of our 53 genes (the R function cor() does this). As we are interested in the strength of these correlations and not in their direction, we take their absolute values and then calculate a mean.  Let’s call this statistic the Mean Absolute Correlation (MAC). The MAC value for the Fredrickson et al. data was .2511. But is this high enough to worry about? How do we know whether it’s higher than what we would expect for independent data?

One appealing way to test this is to use a resampling technique in which you transform the data to derive a distribution under the null hypothesis that the data are independent. The gene expression data form a matrix with 80 rows (one for each subject) and 53 columns (one for each gene expression). Thus, a given subject’s gene expression profile is a 53-element vector, with the order of the elements corresponding to the “original” labeling of the genes. If we want to get a sense for the magnitude of correlation we would obtain by chance just from data such as these, we can simply shuffle the elements of each of the row vectors.  Our resampling analysis will use the following Monte Carlo procedure (expressed in pseudocode):

repeat many times {
   foreach subject_row {
      randomly rearrange the columns
   }
   re-calculate the MAC value and store it
}

The shuffling of the labels breaks the correlations, and allows us to estimate the distribution of MAC values under the null hypothesis of independence.  To the extent the null hypothesis is false, Fredrickson et al.’s data should appear as an outlier.

 

 

gitest

Fredrickson et al.’s genetic data fail this test in a dramatic way (see above figure).  Our MAC value of .2511 for the original data was never exceeded in 10000 Monte Carlo simulations, providing extremely strong evidence against the null hypothesis of independence (p<.0001).

How should the authors have analyzed the data instead?

The presence of the correlations mean that it is not possible to break the multivariate problem down into 53 independent univariate problems.  The simulations by Brown et al. and by Neuroskeptic show very clearly that RR53’s violation of statistical independence lead to an alarmingly high false positive rate. What would be a better way to analyze the data?

One possibility is to collapse all of the 53 genetic observations for each subject together into a single mean, and then use simple univariate regression. But this analysis might disadvantage the researchers’ hypothesis, because information could get lost in the aggregation.

There is another technique called multivariate multiple regression for dealing with more than one response variable, but I’ve never used it myself, and I get the sense that few other people actually know about it or use it either (the Wikipedia entry on the topic is currently a blank page). A multivariate multiple regression would probably be impractical anyway because it would require the estimation of a very large number of parameters whose dependency structure is unknown.

Fortunately, situations in which it is difficult or impractical to apply a parametric approach are usually the same situations in which nonparametric approaches shine. We can analyze the data using a permutation test, by which we construct a null hypothesis distribution for our test statistic under random transformations to the data itself. Permutation tests, unlike parametric tests, do not require observations to be independent; they make the less stringent assumption that the units being relabeled are exchangeable under the null. Essentially this means that the joint probability distribution of the variables is invariant under all different possible re-labelings of the data (see Wikipedia for a more technical definition).

Under the null hypothesis that eudaimonic/hedonic well-being has no relationship to gene expression profiles, the connection between each subject’s psychometric data and gene expression data is arbitrary; any test statistic we observe is just noise. We can get an estimate for the distribution of this noise by re-calculating the test statistic a large number of times, each time randomly pairing rows from the gene matrix with rows from the psychometric matrix (see figure below, where each subject is a single row and is represented by a single color.)  Note that the dependency structure of the gene data is fully controlled for in this analysis, because each row of genetic data remains intact regardless of what row it gets connected to in the psychometric matrix.

bigresampling

An advantage to this analysis is that we can do the exact same RR53 analysis that Fredrickson et al. did, up to the point where they derived their t test statistic. Thus, we will be using as our test statistic the very same t-value from the original RR53 analysis. But instead of comparing it to the t distribution with 52 degrees of freedom, we will generate a new distribution for this test statistic by re-calculating the statistic many times after randomly reconnecting the rows across the matrices.

But before we can do this analysis, there is one minor problem we have to address—what do we with the 15 control variables?  Randomizing the connections between the rows of the matrices requires these connections being exchangeable under the null.  Although the connections between E and H and the gene data are exchangeable, it is possible that relationships exist between the gene data and the control variables (which include things such as age, sex, smoking, illness, etc.), thus violating exchangeability.

We can solve this problem by residualizing the genetic data on the control variables: that is we regress the data for each gene on the control variables and replace the raw genetic data with the residuals. The links between the residuals (the genetic data after controlling for variables I–W) and the psychometric predictor variables E and H are exchangeable under the null. We re-pair the rows from the two matrices a large number of times (10,000 in the current case), re-computing the t-value after each, which gives us a stable estimate of its null-hypothesis distribution.

 

nhd

To get a p-value, we need to compare our observed t-value of -3.242 to a null-hypothesis distribution (NHD) for the statistic.  Fredrickson et al.’s approach, which wrongly assume independence of the 53 analyses, uses as its NHD the t distribution with 52 degrees of freedom (blue distribution in the figure).  Comparing the observed test statistic to this distribution gives us a p-value of .003.  In contrast, if we use as our NHD the one we derived by permuting the data (red distribution), which does not assume statistical independence, we obtain a p-value of .390.  So, using an appropriate statistical analysis, the evidence for the link between eudaimonic well-being and gene expression disappears.

Why did this happen, and what should be done?

Fredrickson et al. have thus far not responded to Brown et al.’s criticism that their analysis violated statistical independence assumptions. My analysis further strengthens Brown et al.’s case, showing clear, direct evidence for dependency in the genetic data.  Moreover, my reanalysis goes beyond this to suggest that the original evidence for a relationship between eudaimonic well-being and gene expression is almost certainly a statistical artifact.

Fredrickson et al. noted in their reply to Brown et al. that they have replicated their findings with a larger sample.  But without addressing the flaws in their original analysis, the only thing this could possibly show is the authors’ willingness to repeat their mistakes on an even larger data set.

It should have been clear to anyone with a modicum of statistical training that a t-test is an entirely inappropriate strategy for analyzing multivariate dependent data. This alone should have prevented the paper from being published.   How did the editor and reviewers fail to recognize this?  Clues can be found by reading the authors’ description of their RR53 procedure and results in the main text (page 13685):

General linear model analyses quantified the association between expression of each of the 53 CTRA contrast genes and levels of hedonic and eudaimonic well-being [each well-being dimension treated as a continuous measure and adjusted for correlation with the other dimension of well-being and for age, sex, race/ethnicity, body mass index (BMI), smoking, alcohol consumption, recent minor illness symptoms, and leukocyte subset prevalence; SI Methods]. Contrast coefficient-weighted association statistics were averaged to summarize the magnitude of association over the entire CTRA gene set. CTRA gene expression varied significantly as a function of eudaimonic and hedonic well-being (Fig. 2A). As expected based on the inverse association of eudaimonic well-being with depressive symptoms, eudaimonic well-being was associated with down-regulated CTRA gene expression (contrast, P = 0.0045).  In contrast, CTRA gene expression was significantly up-regulated in association with increasing levels of hedonic well-being (P = 0.0047).

What we see here is a lack of transparency in the reporting of methods and results.  The critical information that a one-sample t-test was performed is never mentioned in the main text; instead, it is buried in the supplementary materials.  Departing from convention, the p-values are reported without the observed test statistics nor the degrees of freedom for the test.  A reviewer who saw “t(52)=X.XX, p=.0045” instead of just “(contrast, p=.0045)”, would have known that a t-test was performed, that 52 degrees of freedom had been assumed, and therefore would be far more likely to realize that the analysis was inappropriate.

The controversy is not over yet, and it seems doubtful that Fredrickson and colleagues will be able to mount a credible defense against the criticism of statistical dependence.  Time will tell whether the scientific record gets corrected.  In the meantime, it is illustrative to compare the reaction of Fredrickson’s et al. to that of a different group authors whose PNAS paper also came under criticism for statistical flaws.  As discussed on Retraction Watch, once these latter authors were made aware of problems with their analyses, they immediately retracted their article, stating:

Our reanalyses suggest that if adjustments for this confound are applied, the results for our top finding no longer reach experiment-wide significance. Therefore, we feel that the presented findings are not currently sufficiently robust to provide definitive support for the conclusions of our paper, and that an extensive reanalysis of the data is required. The authors have therefore unanimously decided to retract this paper at this time.

Appendix: Code for the analysis

Trial by p-values: Preliminary thoughts on the Jens Förster report

I’ve just had a quick look at the report (available at Retraction Watch) that led to the investigation of Jens Förster for possible data manipulation. It makes the case that the data in three of Förster’s papers are statistically highly improbable, largely due to the fact that the means for the levels of various three-level factors all tend to fall in a straight line. There are also claims about the data being far too consistent across independent studies, of effect sizes being implausibly large, and demographically implausible samples.

It is dry, depressing reading.

From the comments I’ve seen on Retraction Watch and Twitter, some people are already convinced. For my part, I’m reserving judgment until the psychological/statistical community has time to complete its “post-publication peer review” of the report. To stimulate discussion, here are some thoughts I had after a first read:

1. The report analyzes just 3 papers. Forster is a highly productive researcher with 50+ papers. How were these 3 chosen? Were there other papers by Forster which did not show any questionable patterns?

2. The F-test for linearity assumes continuous DVs, but some of the DVs are discrete (from rating scales). The simulations at the end of the report suggest that the test might be robust against violations of this assumption, but are the simulations themselves valid and based on reasonable assumptions?

3. Control studies were selected based on a search for single factor 3-level designs. Do control studies involve same type of data? Did the selection process for the control studies mimic the selection process that led to identification of the 3 questionable papers?

4. Could p-hacking give rise to a linear pattern? (this idea is from @bahnik)

If this is going to be a trial by p-values, I hope that we can make sure that Jens Förster gets a fair hearing!