Lack of Power (and not the statistical kind)

One thing that never really comes up when people talk about “Questionable Research Practices,” is what to do when you’re a junior in the field and someone your senior suggests that you partake. [snip] It can be daunting to be the only one on who thinks we shouldn’t drop 2 outliers to get our p-value from .08 to .01, or who thinks we shouldn’t go collect 5 more subjects to make it “work.” When it is 1 vs 4 and you’re at the bottom of the totem pole, it rarely works out the way you want. It is hard not to get defensive, and you desperately want everyone to just come around to your thinking- but it doesn’t happen. What can the little guy say to the behemoths staring him down?

I’ve recently been put in this situation, and I am finding it to be a challenge that I don’t know how to overcome. It is difficult to explain to someone that what they are suggesting you do is [questionable] (At least not without sounding accusatory). I can explain the problems with letting our post hoc p-value guide interpretation, or the problems for replicability when the analysis plan isn’t predetermined, or the problems with cherry picking outliers, but it’s really an ethical issue at its core. I don’t want to engage in what I know is a [questionable] practice, but I don’t have a choice. I can’t afford to burn bridges when those same bridges are the only things that get me over the water and into a job.

I’ve realized that this amazing movement in the field of psychology has left me feeling somewhat helpless. When push comes to shove, the one running the lab wins and I have to yield- even against my better judgment. After six five months of data collection, am I supposed to just step away and not put my name on the work? There’s something to that, I suppose. A bit of poetic justice. But justice doesn’t get you into grad school, or get you a PhD, or get you a faculty job, or get you a grant, or get you tenure. The pressure is real for the ones at the bottom. I think more attention needs to be paid to this aspect of the psychology movement. I can’t be the only one who feels like I know what I should (and shouldn’t) be doing but don’t have a choice.

Edit: See another great point of view on this issue here http://jonathanramsay.com/questionable-research-practices-the-grad-student-perspective/

edit3: Changed some language

Musings on correlations- doubling my sample size doesn’t help much

I’ve recently run an experiment where I train kids on a computer task and see how they improve after a practice session. We want to see if the kids improve more as they get older, and so we calculate the correlation between the kids’ ages (in months) and their improvement scores.¹ If we tested 40 kids and found a correlation of .30, how much should we trust our result? I did some simulations to find out. This was inspired by a recent paper by Stanley and Spence (2014).

A common way to represent the uncertainty present in an estimate is to calculate the confidence interval (usually 95%) associated with that estimate. Shorter intervals mean less uncertainty in the estimate. The calculations for 95% confidence intervals ensure that, in the very long run, 95% of your intervals will capture the true population value. The key here is that only through repeated sampling can you be confident that most of your intervals will be in range of that true population value. For example, if I find my correlation estimate is r=.30 95%CI [-.01, .56] then presumably the true correlation could be anywhere in that range, or even outside of it if I’m unlucky (you really never know). That can say a lot to some stats people, but I like to see what it actually looks like through simulation.

Below are results for 500 samples of 40 participants each when the population correlation is .10, .30, or .50 (signified by the vertical line- that weird p on the axis is called rho) and beside each is an example of what that correlation might look like. You can click the picture to see it larger. Each sample is pulling from the same population, meaning that the variation is only due to sampling error. Each green dot is a sample whose correlation estimate and 95% interval capture the true correlation, red dots are samples that fail to capture the true correlation. As you can see, with 40 participants in each sample there is a lot of variation. Imagine your experiment as picking one of these dots at random.

figs 40nSome quick observations: 1) most samples fall fairly close to the true value, 2) the range on all of these samples is huge. So with 40 subjects in our sample, a correlation of .30 is green in each of these different populations. How can we know which one we are actually sampling from if we only have our single sample? One commonly proposed solution is to take larger samples. If each sample consisted of 80 participants instead of 40, we would expect the sample heaps to be narrower. But how does it change our interpretation of our example r=.30? With n=80, the 95% CI around .30 ranges from .09 to .49.

figs 80nNow with n=80 the interpretation of our result only changes slightly. When the true correlation is .10 our r=.30 is still just ever so slightly green; remember, our 95% CI ranged as low as .09. However, when the true correlation is .50 our r=.30 is now red; our 95% CI ranged only as high as .49. But remember, when you only have your single sample you don’t know what color it is- Always remember that it could be red! In the very long run 5% of all samples will be red.

So what is the takeaway from all of this? When we doubled our sample size from n=40 to n=80, our 95% CI shrunk from [-.01, .56] to [.09, .49]- at least we can tentatively rule out a sign error² when we double the sample. That really isn’t much. And when you look at the figures, the range of estimates gets smaller for each respective population- but not much changes in terms of our interpretation of that single r=.30. That really sucks. It’s hard enough for me to collect data and train 40 preschoolers, let alone 80. But even if I did double my efforts I wouldn’t get much out of it! That really really sucks.

There is no happy ending to this blog post. This seems bleak! Hopefully other people can replicate my study’s findings and we can pool our knowledge to end up with more informative estimates of these effects.

Thanks for reading, please comment!

 

¹For those who need a refresher on correlations, a correlation can be negative (an increase in age corresponds with a decrease in improvement score) or positive (increase in age -> increase improvement score). The range goes from -1, meaning a perfect negative correlation, to +1, meaning a perfect positive correlation. Those never actually happen in real experiments.

²A sign error is claiming with confidence that the true correlation is positive when in fact it is negative, or vice versa.

Practice Makes Perfect (p<.05)

What’s wrong with [null-hypothesis significance testing]? Well… it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! (Cohen 1994, pg 997)

That quote was written by Jacob Cohen in 1994.What does it mean? Let’s start from the top.

A null-hypothesis significance test (NHST) is a statistical test in which one wishes to test a research hypothesis. For example, say I hypothesize that practicing  improves performance (makes you faster) when building a specific lego set. So I go out and collect some data to see how much people improve on average from a pretest to a post test- one group with no practice (control group) and another group with practice (experimental group). I end up finding that people improve by five minutes when they practice and they don’t improve when they don’t practice. That seems to support my hypothesis that practice leads to improvement!

legos

Typically, however, in my field (psychology) one does not simply test their research hypothesis directly, first one sets up a null-hypothesis (i.e., H0, typically the opposite of their real hypothesis: e.g., no effect, no difference between means, etc.) and collects data trying to show that the null-hypothesis isn’t true. To test my hypothesis using NHST, I would first have to imagine that I’m in a fictitious world where practicing on this measure doesn’t actually improve performance (H0 = no difference in improvement between groups). Then I calculate the likelihood of finding results at least as extreme as the ones i found. If the chance of finding results at least as extreme as mine is less than 5%, we reject the null-hypothesis and say it is unlikely to be true.

In other words, I calculate the probability of finding a difference of improvement between groups of at least 5 minutes on my lego building task- remember, in a world where practicing doesn’t make you better and the groups improvements aren’t different- and I find that my probability (p-value) is 1%. Wow! That’s pretty good. Definitely less than 5% so I can reject the null-hypothesis of no improvement when people practice.

But what do I really learn from a significance test? A p-value only tells me the chance that I should find data like mine in a hypothetical world, a world that I don’t think is true, and I don’t want to be true. Then when I find data that seem unlikely in a world where H0 is true, I conclude that it likely isn’t true. The logic of the argument is thus:

If H0 is true, then this result (statistical significance) would probably not occur.

This result has occurred.

Then H0 is probably not true [….] (Cohen, 1994 pg 998)

So: if it’s unlikely to find data like mine in a world where H0 is true, then it is unlikely that the null-hypothesis is true. We want to say is how likely our null-hypothesis is by looking at our data.  That’s inverse reasoning though. We don’t have any information about the likelihood of H0, we just did an experiment where we pretended that it was true! How can our results from a world in which H0 is true provide evidence that it isn’t true? It’s already assumed to be true in our calculations! We only make the decision to reject H0 because one day we arbitrarily decided that our cut-off was 5%, and anything smaller than that means we don’t believe H0 true.

Maybe this will make it more clear why that reasoning is bad:

If a person is an American, then he is probably not a member of Congress. (TRUE, RIGHT?)

This person is a member of Congress.

Therefore, he is probably not an American. (ibid)

That’s the same logical structure that the null-hypothesis test takes. Obviously incoherent when we put it like that right?

This problem arises because we want to say “it is unlikely that the null-hypothesis is true,” but what we really say with a p-value is, “it is unlikely to find this extreme of data when the null-hypothesis is true.” Those are very different statements. One gives a likelihood of a hypothesis given a data set, P( Hypothesis | Data) and the other gives a likelihood of data given a hypothesis, P( Data | Hypothesis). No matter how much we wish for it to be true, the two probabilities are not the same. They’re never going to be the same. P-values will never tell us what we want them to tell us. We should stop pretending they do and we should acknowledge the limited inferential ability of our NHST.

 

Thanks for reading, comment if you’d like.

 

“New” Statistics and Research Integrity

In about a week I’ll be leading a journal club discussion on this paper, “The New Statistics: Why and How“. I think it would behoove me to do a quick run through the paper before we get to the seminar table so I don’t get lost.

The main focus points from the paper:

1. We should promote research integrity by addressing three problems.

First, we should make sure all research results are published (if not in a big name journal then at least in an online depository). If only big, exciting findings make it into journals we’ll have a biased set of results which leave us misinformed. Second, we need to avoid bias in data selection. Namely, we need to denote which results we predicted ahead of time and which we found after peeking at the data.The reason this is a problem is that many times the distinction isn’t made between prespecified and post-hoc analyses, allowing researchers to report results that might simply be lucky patterns. Third, we should do more replication and always report it. If all the “failed” replications never get reported, it seems reasonable to assume that some of the published literature has overestimated the size and reliability of their results. If we make a go at the same procedure and find much smaller (or larger!) effects, by reporting the results we paint a more realistic picture of the findings.

2. We should switch our thinking from, “I must find statistically significant results,” to “I should try to be as precise in my estimate as possible.”

The best way, says Cumming, is to move entirely away from the thinking involved in trying to deny a null-hypothesis (typically the opposite of what we really want to claim) that we never actually believed in the first place, and most certainly is known to be false from the outset. For example, if we want to show that have men higher levels of testosterone than women and find men avg. 80mg vs women avg. 50mg in a blood sample, we wouldn’t test the likelihood of our actual hypothesis. We would first need to set up a hypothesis we want to disprove- that men and women are not different in testosterone levels, then we would calculate the chance of finding data as extreme or more extreme as the ones we found. In other words, we instead have to ask “What is the chance that we would find a result as extreme or more extreme as we found, if we assume they actually don’t differ at all?” Kinda hard to wrap your head around right?

That’s what a p-value describes. So if we find there is only a 1% chance of finding data as extreme as ours in a hypothetical world where there is no real difference, then we say “our result is statistically significant, p <.05, so we reject the null-hypothesis that men and women have equal testosterone levels.” Note that this doesn’t actually tell us anything about the likelihood of our hypothesis– namely, that men have higher levels. It only tells us the likelihood of finding our data if we assume that there is no difference between men and women. It also doesn’t say anything about how big the difference is between men and women. This method is confusing because it relies on calculations that take into account things we don’t actually observe or think.

Cumming’s answer is to do away with p and simply report how big our effects are and then convey how precise our measurements are in the form of a confidence interval, usually set at 95%. Back to testosterone, if we found that men had 30 +/- 5 mg higher testosterone than women, then that statement conveys both the size of our effect (30 mg) and the amount of uncertainty we have about the data  (it could be off by 5 mg in either direction). Cumming thinks that this method is superior in every way to the significance tests that are so popular because it reports more information in an easier to digest format. It also lends itself to more reliable estimates of how our results would turn out if we conducted the experiment again. Replication of an original study with a p of .05 can range from <.001 to about . 40. That’s a huge range! You could get lucky and find very strong evidence or you can be unlucky and never have a chance in hell of getting results published. Alternatively, if the original report estimated an effect of 30 +/- 5 then there is an 83% chance that a replication study will find a value between 25 and 35. That does seem a lot more informative than p.

He goes into a lot more about meta-analysis and effect sizes, but I don’t really want to write anymore since this post is pretty long. Maybe I’ll continue it in another! Thanks for reading.