An undergraduate’s experience with replications

A lot of psychologists are in a bit of a tiff right now. I think everyone agrees that replications are important, but it doesn’t seem like there is a consensus for how it should go about (For many perspectives, see: here, here, here, here, here, here, here). Since Sanjay asked for more perspectives from people who aren’t tenured, I figured I’d write up my experience with replication. Take note, I graduated but I am not in graduate school yet, so I am one vulnerable puppy. Luckily my experience was very civil.

During my junior/senior year fellowship, I ran 2 identical direct replications of a psychophysics experiment and both were disappointing. I wasn’t the first person in the lab to try to replicate it either: the addition of my “failures” made it 5 collective unsuccessful replications. At what point do you throw in the towel and say, “We’re never gonna get it”? I went on to manipulate the stimuli and task and ended up finding some cool results, but the taste of sour data was still in my mouth. The worst part was that I had to slap my “failures to replicate” on a poster and travel cross-country to present them at a conference. I was nervous before presenting, because how are you supposed to explain failures to replicate in psychophysics? It’s not like social psych, where one can point to the specter of “unknown moderators” (no offense, that’s my field now).

So, how did the conference go? Very well I should think. I was not surprised by some reactions I got from viewers when I said those dreaded words, “failed to replicate,” on the order of: “Oh wow, that sucks for them,” “Welp, that’s never good,” “Oh no! He’s in my department…..that’s embarrassing,” “Did you really try 5 times? I would have stopped after 1.” The most stress-inducing part of the whole thing was when the person I was failing to replicate came up and introduced himself. I was expecting hurt feelings, or animosity. What I got was a reasonable reply from a senior in my field. He said, “Well, that’s really too bad. You never got it in 5 tries? Hmmm…. I guess we might have overestimated how robust that effect is. It could be that it is just a weak effect. We’ve moved on since then to show the effect with other stimuli but we haven’t done this exact setup again, maybe we should. Thanks for sharing with me, if you write up the manuscript I’d love it if you sent it to me when it’s done.”

What a reasonable guy. I was expecting barred teeth and a death stare, but what I got was a senior in the field who was open to revising his beliefs.

One thing to note: his comment, “if you write up the manuscript I’d love it if you sent it to me when it’s done (emphasis added)” really highlights the view that replications are likely to be dropped if they “fail.” Hopefully this special issue can change the culture and change that if to when. Thanks to Daniel Lakens (@lakens) and Brian Nosek (@BrianNosek) for trailblazing.

Musings on correlations- doubling my sample size doesn’t help much

I’ve recently run an experiment where I train kids on a computer task and see how they improve after a practice session. We want to see if the kids improve more as they get older, and so we calculate the correlation between the kids’ ages (in months) and their improvement scores.¹ If we tested 40 kids and found a correlation of .30, how much should we trust our result? I did some simulations to find out. This was inspired by a recent paper by Stanley and Spence (2014).

A common way to represent the uncertainty present in an estimate is to calculate the confidence interval (usually 95%) associated with that estimate. Shorter intervals mean less uncertainty in the estimate. The calculations for 95% confidence intervals ensure that, in the very long run, 95% of your intervals will capture the true population value. The key here is that only through repeated sampling can you be confident that most of your intervals will be in range of that true population value. For example, if I find my correlation estimate is r=.30 95%CI [-.01, .56] then presumably the true correlation could be anywhere in that range, or even outside of it if I’m unlucky (you really never know). That can say a lot to some stats people, but I like to see what it actually looks like through simulation.

Below are results for 500 samples of 40 participants each when the population correlation is .10, .30, or .50 (signified by the vertical line- that weird p on the axis is called rho) and beside each is an example of what that correlation might look like. You can click the picture to see it larger. Each sample is pulling from the same population, meaning that the variation is only due to sampling error. Each green dot is a sample whose correlation estimate and 95% interval capture the true correlation, red dots are samples that fail to capture the true correlation. As you can see, with 40 participants in each sample there is a lot of variation. Imagine your experiment as picking one of these dots at random.

figs 40nSome quick observations: 1) most samples fall fairly close to the true value, 2) the range on all of these samples is huge. So with 40 subjects in our sample, a correlation of .30 is green in each of these different populations. How can we know which one we are actually sampling from if we only have our single sample? One commonly proposed solution is to take larger samples. If each sample consisted of 80 participants instead of 40, we would expect the sample heaps to be narrower. But how does it change our interpretation of our example r=.30? With n=80, the 95% CI around .30 ranges from .09 to .49.

figs 80nNow with n=80 the interpretation of our result only changes slightly. When the true correlation is .10 our r=.30 is still just ever so slightly green; remember, our 95% CI ranged as low as .09. However, when the true correlation is .50 our r=.30 is now red; our 95% CI ranged only as high as .49. But remember, when you only have your single sample you don’t know what color it is- Always remember that it could be red! In the very long run 5% of all samples will be red.

So what is the takeaway from all of this? When we doubled our sample size from n=40 to n=80, our 95% CI shrunk from [-.01, .56] to [.09, .49]- at least we can tentatively rule out a sign error² when we double the sample. That really isn’t much. And when you look at the figures, the range of estimates gets smaller for each respective population- but not much changes in terms of our interpretation of that single r=.30. That really sucks. It’s hard enough for me to collect data and train 40 preschoolers, let alone 80. But even if I did double my efforts I wouldn’t get much out of it! That really really sucks.

There is no happy ending to this blog post. This seems bleak! Hopefully other people can replicate my study’s findings and we can pool our knowledge to end up with more informative estimates of these effects.

Thanks for reading, please comment!

 

¹For those who need a refresher on correlations, a correlation can be negative (an increase in age corresponds with a decrease in improvement score) or positive (increase in age -> increase improvement score). The range goes from -1, meaning a perfect negative correlation, to +1, meaning a perfect positive correlation. Those never actually happen in real experiments.

²A sign error is claiming with confidence that the true correlation is positive when in fact it is negative, or vice versa.

Practice Makes Perfect (p<.05)

What’s wrong with [null-hypothesis significance testing]? Well… it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! (Cohen 1994, pg 997)

That quote was written by Jacob Cohen in 1994.What does it mean? Let’s start from the top.

A null-hypothesis significance test (NHST) is a statistical test in which one wishes to test a research hypothesis. For example, say I hypothesize that practicing  improves performance (makes you faster) when building a specific lego set. So I go out and collect some data to see how much people improve on average from a pretest to a post test- one group with no practice (control group) and another group with practice (experimental group). I end up finding that people improve by five minutes when they practice and they don’t improve when they don’t practice. That seems to support my hypothesis that practice leads to improvement!

legos

Typically, however, in my field (psychology) one does not simply test their research hypothesis directly, first one sets up a null-hypothesis (i.e., H0, typically the opposite of their real hypothesis: e.g., no effect, no difference between means, etc.) and collects data trying to show that the null-hypothesis isn’t true. To test my hypothesis using NHST, I would first have to imagine that I’m in a fictitious world where practicing on this measure doesn’t actually improve performance (H0 = no difference in improvement between groups). Then I calculate the likelihood of finding results at least as extreme as the ones i found. If the chance of finding results at least as extreme as mine is less than 5%, we reject the null-hypothesis and say it is unlikely to be true.

In other words, I calculate the probability of finding a difference of improvement between groups of at least 5 minutes on my lego building task- remember, in a world where practicing doesn’t make you better and the groups improvements aren’t different- and I find that my probability (p-value) is 1%. Wow! That’s pretty good. Definitely less than 5% so I can reject the null-hypothesis of no improvement when people practice.

But what do I really learn from a significance test? A p-value only tells me the chance that I should find data like mine in a hypothetical world, a world that I don’t think is true, and I don’t want to be true. Then when I find data that seem unlikely in a world where H0 is true, I conclude that it likely isn’t true. The logic of the argument is thus:

If H0 is true, then this result (statistical significance) would probably not occur.

This result has occurred.

Then H0 is probably not true [….] (Cohen, 1994 pg 998)

So: if it’s unlikely to find data like mine in a world where H0 is true, then it is unlikely that the null-hypothesis is true. We want to say is how likely our null-hypothesis is by looking at our data.  That’s inverse reasoning though. We don’t have any information about the likelihood of H0, we just did an experiment where we pretended that it was true! How can our results from a world in which H0 is true provide evidence that it isn’t true? It’s already assumed to be true in our calculations! We only make the decision to reject H0 because one day we arbitrarily decided that our cut-off was 5%, and anything smaller than that means we don’t believe H0 true.

Maybe this will make it more clear why that reasoning is bad:

If a person is an American, then he is probably not a member of Congress. (TRUE, RIGHT?)

This person is a member of Congress.

Therefore, he is probably not an American. (ibid)

That’s the same logical structure that the null-hypothesis test takes. Obviously incoherent when we put it like that right?

This problem arises because we want to say “it is unlikely that the null-hypothesis is true,” but what we really say with a p-value is, “it is unlikely to find this extreme of data when the null-hypothesis is true.” Those are very different statements. One gives a likelihood of a hypothesis given a data set, P( Hypothesis | Data) and the other gives a likelihood of data given a hypothesis, P( Data | Hypothesis). No matter how much we wish for it to be true, the two probabilities are not the same. They’re never going to be the same. P-values will never tell us what we want them to tell us. We should stop pretending they do and we should acknowledge the limited inferential ability of our NHST.

 

Thanks for reading, comment if you’d like.

 

Using journal rank as an assessment tool- we probably shouldn’t do it

This is my summary of Brembs, Button, and Munafo (2013), “Deep impact: unintended consequences of journal rank.” Main points I took from the paper: 1) Some journals get their “impact factor” through shady means. 2) How does journal rank relate to reliability of results and rate of retractions? 3) Do higher ranking journals publish “better” findings? 4) What should we do if we think journal rank is a bunk measure?

1) How do journals get their impact factor (IF) rank? It’s an account of the number of citations that publications in that journal get per the amount of articles in the journal- and a higher impact factor is seen as more prestigious. Apparently some journals are negotiating their IF and inflating it artificially. There is quite a bit of evidence that some journals inflate their ranking by changing what kinds of articles count for their IF, such as excluding opinion pieces and news editorials. Naturally, if you reduce how many articles count towards the IF but keep the number of citations constant, there will be a stronger ratio of number of citations to number of articles. It gets worse though, as a group of researchers purchased the data from journals in an attempt to manually calculate their impact factor, and are sometimes off by up to 19% of what the journal claims! So even if you know all the info about citations and articles in a journal, you still can’t figure out their IF. Seems kinda fishy.

2) Brembs and colleagues looked at the relation a journal’s rank had on both retraction rates and decline effects. Rate of retractions in the scientific literature have gone from up drastically recently, and now the majority of all retractions are due to scientific misconduct, purposeful or otherwise. They found a strong correlation between a journal’s impact factor and retraction rate (figure 1d):Image

As we can see, as a journal’s impact factor rises so too does it’s rate of retractions. Why this happens is likely a mixture of social pressures- the push for publishing in high journals increases unreliability of findings and higher visibility of papers. If more people see your paper, there is a better chance someone is going to catch you out. A popular case right now is the retraction of a publication in Nature of a novel acid bath procedure that can create certain types of stem cells. It went through 9 months of peer-review, and yet it only took a handful of weeks for it to be retracted once everyone else got their turn at it. It turns out that one of the authors was reproducing figures and results from other work they had done in the past that didn’t get much press.

The decline effect is an observation that some initially strong reported effects (say a drug’s ability to treat cancer) can gradually decline as more studies are done, such that the initial finding is seen as a gross overestimate- and the real effect is estimated to be quite small or even zero. Here I’ve reproduced figure 1b from Brembs et al., showing a plot of the decline of the reported association between carrying a certain gene and your likelihood to succumb to alcoholism. The size of the bubbles indicates the relative journal impact factor and the higher on the y-axis the bubble is, the stronger the reported association. Clearly, as more data come in (from the lower impact journals) there is less and less evidence that the association is as strong as initially reported in the high impact journals. Image

So what should we take from this? Clearly there are higher rates of retractions in high impact journals. Additionally, some initial estimates reported in high impact journals lend themselves to a steep decline in their evidential value as smaller impact journals report consistently smaller effects as time goes on. Unfortunately, once the media gets hold of the big initial findings from prominent journals it’s unlikely the smaller estimates from less known journals get anywhere near the same press.

3) There is a perception that higher ranking journals publish more important science. There is a bit of evidence showing that a publication’s perceived importance is tied to it’s publishing journal’s impact factor, and experts rank papers from high impact journals as more important.* However, further investigation shows that journal ranking only accounts for a small amount of a paper’s number of citations (R² = .1 to .3). In other words, publishing in a high impact journal confers a small benefit on the number of citations a paper garners, likely due more to the effects high impact journals have on reading habits than due to the higher quality of the publications.

4) Brembs et al recommend that we stop using journal rank as an assessment tool, and instead “[bring] scholarly communication back to the research institutions … in which both software, raw data and their text descriptions are archived and made accessible (pg 8).” They want us to move away from closed publication that costs up to $2.8 billion annually to a more open evaluation system.

Overall I think they make a strong case that the commonly held assumptions about journal rank are misguided, and we would should be advocating for a more open reporting system. Clearly the pressures of the “publish-or-perish” culture in academia right now are making otherwise good people do shady things (and making it easier for shady people to get away with what they’d do anyways). That’s not to say the people involved aren’t responsible, but there is definitely a culture that encourages subpar methods and behavior. The first step is creating an environment in which people are comfortable publishing “small” effects and where we encourage replication and combination across multiple findings before we make any claims with relative certainty.

*However, in that study they didn’t mask the name of the journal that the papers were published in, so there could be confounding subjective valuations from the experts on the paper’s perceived importance.

“New” Statistics and Research Integrity

In about a week I’ll be leading a journal club discussion on this paper, “The New Statistics: Why and How“. I think it would behoove me to do a quick run through the paper before we get to the seminar table so I don’t get lost.

The main focus points from the paper:

1. We should promote research integrity by addressing three problems.

First, we should make sure all research results are published (if not in a big name journal then at least in an online depository). If only big, exciting findings make it into journals we’ll have a biased set of results which leave us misinformed. Second, we need to avoid bias in data selection. Namely, we need to denote which results we predicted ahead of time and which we found after peeking at the data.The reason this is a problem is that many times the distinction isn’t made between prespecified and post-hoc analyses, allowing researchers to report results that might simply be lucky patterns. Third, we should do more replication and always report it. If all the “failed” replications never get reported, it seems reasonable to assume that some of the published literature has overestimated the size and reliability of their results. If we make a go at the same procedure and find much smaller (or larger!) effects, by reporting the results we paint a more realistic picture of the findings.

2. We should switch our thinking from, “I must find statistically significant results,” to “I should try to be as precise in my estimate as possible.”

The best way, says Cumming, is to move entirely away from the thinking involved in trying to deny a null-hypothesis (typically the opposite of what we really want to claim) that we never actually believed in the first place, and most certainly is known to be false from the outset. For example, if we want to show that have men higher levels of testosterone than women and find men avg. 80mg vs women avg. 50mg in a blood sample, we wouldn’t test the likelihood of our actual hypothesis. We would first need to set up a hypothesis we want to disprove- that men and women are not different in testosterone levels, then we would calculate the chance of finding data as extreme or more extreme as the ones we found. In other words, we instead have to ask “What is the chance that we would find a result as extreme or more extreme as we found, if we assume they actually don’t differ at all?” Kinda hard to wrap your head around right?

That’s what a p-value describes. So if we find there is only a 1% chance of finding data as extreme as ours in a hypothetical world where there is no real difference, then we say “our result is statistically significant, p <.05, so we reject the null-hypothesis that men and women have equal testosterone levels.” Note that this doesn’t actually tell us anything about the likelihood of our hypothesis– namely, that men have higher levels. It only tells us the likelihood of finding our data if we assume that there is no difference between men and women. It also doesn’t say anything about how big the difference is between men and women. This method is confusing because it relies on calculations that take into account things we don’t actually observe or think.

Cumming’s answer is to do away with p and simply report how big our effects are and then convey how precise our measurements are in the form of a confidence interval, usually set at 95%. Back to testosterone, if we found that men had 30 +/- 5 mg higher testosterone than women, then that statement conveys both the size of our effect (30 mg) and the amount of uncertainty we have about the data  (it could be off by 5 mg in either direction). Cumming thinks that this method is superior in every way to the significance tests that are so popular because it reports more information in an easier to digest format. It also lends itself to more reliable estimates of how our results would turn out if we conducted the experiment again. Replication of an original study with a p of .05 can range from <.001 to about . 40. That’s a huge range! You could get lucky and find very strong evidence or you can be unlucky and never have a chance in hell of getting results published. Alternatively, if the original report estimated an effect of 30 +/- 5 then there is an 83% chance that a replication study will find a value between 25 and 35. That does seem a lot more informative than p.

He goes into a lot more about meta-analysis and effect sizes, but I don’t really want to write anymore since this post is pretty long. Maybe I’ll continue it in another! Thanks for reading.