6/1-6/7

I took a bit of a vacation and break. I needed to recharge my batteries and I was feeling somewhat burned out. I think this problem probably comes from not taking more little breaks here and there. I think I learned a valuable lesson: don’t spend all day every day studying and reading and writing. Take a break!

6/8

Starting up again slowly. I’m going to try to finally get this blog post I’ve been ruminating over for about 1 month out. I’ve thought about what it really needs, and that is a real-world example that can highlight exactly how one might use bayesian updating. Time to start thinking!

6/9

Working on the code for that post. The pretty( ) function is amazing! It works by creating nice axis tick marks that encompass your data range. I tell it how far the lowest data point is and how far the highest is and it automatically creates evenly space, nice multiples (i.e., 2,4,6,8, etc) for the ticks. Wow! Game changer for something like this program that is automated to cover any range of data that a user inputs.

6/10

Daniel Lakens made a great point about his motivation for writing blog posts and sharing ideas on blogs and on twitter:

There are people who don’t just listen to me, but start to think for themselves, and then agree to different levels.

I think that’s true for a lot of us who do this. We want to make people think about the issues and topics we bring up. I agree with Daniel that it’s not all about changing people’s minds on some issue (although I am human, so I’d like people to agree with my arguments š Ā ) but about giving people the option to take what you’ve presented and form their own opinion on it. Perhaps someone will read a blog post of mine and think, yeah I’d like to give that Bayesian thing a try. It’s not about converting the masses, but about giving them the option if they think it matches their intuitions.

Also- I didn’t take my own advice and worked on this blog post and program for about 8 hours today. NEED TO TAKE BREAKS. It’s a marathon.

6/11

Worked for about 2 more hours on blog code. Seems the auto-plots for the binomial program were getting caught up on Inf values. I had the y-axis scale based off of the maximum values on the curve, but it can’t make a y-axis that goes to infinity… Oops. I tried a lot of different ways of getting around this. I tried replacing any infinite values with NA values, but and then plotting without the NAs. But that had the side effect of making x and y different lengths, so the plot error’d out. What I ended up doing is just excluding zero and one from the plots altogether. This means when it graphs any beta function, it stops at .001 and .999 instead of zero and one. The upside is that the difference is indistinguishable by eye, and the downside is that it technically isn’t showing the proper curves. For example, the beta(0,0) should have y-values of infinity for both x = 0 and x = 1 and y-values of zero for anything else. When this function gets auto-plotted it now simply shows every value at zero. So in this case it doesn’t quite work perfectly. But really, nobody should be using beta(0,0) for inference so I think it’s okay.

6/12

Thinking about how to auto-plot the normal priors/data when the null value that gets specified is outside of the plot margins. Right now I have it set to automatically focus on the limits of the posterior. If the prior is super broad, a lot of it gets cut out. If the likelihood is super broad, same deal. What this means is that there could be a posterior such as ~ N(m = 10, sd = 2) with a null values set to zero, and the plot won’t even show the null since it is ~5 SDs out and the plot really only goes 3 SDs out. To work around this I think I can set up an if( ) clause where it specifically changes the x-axis limits to always include the null. If the null is below the minimum, then set the null to be the minimum. If the null is above the maximum, then set the null to be the maximum. This might not always be pretty but I think it overcomes the problem for now. Will report back when I’ve tried to make it work.

6/13

Thinking hard about this blog post. What’s the goal? What’s the structure? How high level will the commentary be? Is this a tutorial, or is this geared as a “look how neat this is”? Hmm.

6/14

Interesting talk on twitter today,Ā ‘imagine that you had to include a āMost Damning Resultā section in your paperā¦How would that change your thinking?’. Well, I know that probably most researchers would be totally benign in this section. This was my prediction:Ā “And our most damning result: It turns out our data is *not quite* normally distributed after all”. I mean really. We have a field that has a vivid history of doing whatever it takes to make their data and results look pretty and publishable. Does anyone think that they’re just going to include a section in their paper that kills the result? Or that makes their conclusion uninterpretable? Give me a break. It will be just as meaningful as the “limitations” sections that are full of BS caveats and nothing interesting. Do any limitation sections flat out say, “oh by the way, you can’t conclude a causal direction”? No way, because their conclusions are usually that one CAN conclude causal relationships. So no, I wouldn’t say I’m hopeful for that kind of section.

6/15

I’ve decided I’ll be writing the blog post on the NCAA three-point contest from this year. The Women’sĀ champion beat the Men’sĀ champion. I can use this as a pedagogical example for estimating the shooting efficacy of the Women’s champion. There are four rounds of data, with 25 shots per round. Now, this isn’t a perfect example, because I’m not modeling the dependency in her shots from round to round, how tired she gets as the rounds go on (within and across rounds), or how she gets nervous, or etc etc etc. There are tons of things we could take into account, but toy examples where we ignore all of these other factors are nice for getting the point across.

6/16

Jake Westfall wrote an interesting post today (and then had a great discussion on twitter) about power: use the distribution of meta-analytic effects from the literature to inform your power analysis. The post is all about utilizing prior information to inform your study planning. But then it says nothing about how one should use that same prior information when interpreting the results of the study!! So close, seriously, it’sĀ so close to being a Bayesian analysis at this point I’m sure you can smell it. If the only information one has before the study is run is that typical effects in psychology follow some distribution, then surely one could use that information as their reasonable prior. If it satisfies the researcher’s intuitions then it can totally be used as a prior! If someone says, no I have a lot more info about this topic so here is my prior, then fine and dandy. The more the merrier when it comes to opinions and priors. As Richard Morey says in the twitter thread, the fact that people are even debating which priors should be used means the Bayesians have won the fight.

Also- Chris Engelhardt gave me some welcome feedback on the program I’ve been working on. He suggested to give users the option to specify the axes, so that’s what I’ve implemented. They can use the default auto-axes if they want but otherwise they can specify where they want the tick marks to be. Thanks, Chris š

6/17

The P-Curve on the left shows the highest frequency for studies with p-values less than .01. This pattern is inconsistent with a uniform distribution and it is possible to refute the null-hypothesis.

Thus, not all significant results in Psychological Science are type-I errors.

Emphasis added. If this kind of sentence isn’t the hallmark of a needless statistical test, then I don’t know what is. Does anyoneĀ *actually believe* that all results in the journal Psychological Science are totally bunk? I mean, sure, they exaggerate their claims and probably overstate the evidence, butĀ *come on.Ā *I’m all for skepticism but that’s on a whole ‘nother level. Oh, here’s a link to the post if you’re interested in reading it. It’s on facebook but I think anyone can read it.

6/18

Stephen Senn posted a short article on twitter, and a particular phrase stood out to me.

I am in perfect agreement with Gelmanās strictures against using the data to construct the prior distribution. There is only one word for this and it is ācheatingā.

He is commenting on the idea that one can use the twice if they are not careful: collect some data and see how it looks (mean, sd, range, etc) and then use part of that information to construct a prior against which one will compareĀ *the same data used to construct it.*Ā This is also called double-dipping in some places. The reason this is a problem is that the machinery of bayesian probability (in most cases) assumes that the prior and likelihood (i.e., data) are exchangeable. The prior is simply old data, and the likelihood is new data. But if the likelihood is used to construct a prior, then new data is being combined with new data, and hence the phrase “using the data twice” or “double-dipping” is appropriate. The prior must be constructed independent of the likelihood, otherwise it’s cheating.

6/19

EJ Wagenmakers shared a new paper coming out of his research group, written with Alexander Ly and Josine Verhagen. The paper explains how Harold Jeffreys approached hypothesis testing and the details therein. Neat paper, glad to see it has been accepted. The paper is incredibly valuable because it dives into Jeffreys’s old book where a lot of foundation work was written, but is hard to access because of its old and nuanced terminology/notation. Very cool work and should be read by anyone interested in foundations and history of bayesian statistics/probability.

6/20

I had a lovely day celebrating my grandparents’ sixtieth anniversary today. My family came in town from all over the country, and we had a wonderful time. No stats today, too busy.

6/21

I saw a definition of “questionable research practices” as anything that increases type-1 error rate. There is more to statistics than that! Some say that bayesian stats can avoid QRPs but that isn’t exactly true. They avoid pretty much all of the traditional ones, but I would say a questionable research practice for a bayesian would be unacknowledged double-dipping (see entry on 6/18). It doesn’t invalidate the math, but the reader is under the impression that the prior and likelihood are independent. If the reader doesn’t know that there was double-dipping then they cannot make adjustments to their interpretation of the data and analysis.

6/22

There’s constant reimagining of this blog post going on in my head. The message can be about subjectivity, concepts, application, so many things. It’s hard to pick and I really really don’t want it to get super long. Including figures helps a lot with reducing length, somewhat counter intuitively. The figures can take up a lot of space, but they give an anchor for the explanations. One can describe the figures and give concreteness to an abstract concept.

6/23

Uri Simonsohn shared a new blog post in which he examines how transitioning to within-subject designs can actually lower power, counter to many calls for increasing power though use of within-subject design. An interesting post, and a good reminder that one shouldn’t just blindly take advice on statistics. You need to study hard, read a lot, and try to really understand what you’re doing. Otherwise you’ll make a fool out of yourself and find your paper being ripped on a blog.

6/24

Interesting post and subsequent discussion on Deborah Mayo’s blog recently. Tons of comments on this one. The question she poses is straightforward: Is there a way to change one’s prior? As opposed to merely updating it as one usually would to form a posterior. Essentially, she is asking if one can first have a prior, then see some data or result from an experiment, and subsequently go back and change the prior. My answer would be: you can do whatever you want, but changing the prior by taking into account some data should be done through bayes rule, otherwise you lose coherency and you double dip so you lose exchangeability (see above).

6/25

Rolf Zwaan wrote a new blog post about how he thinks psychologists try to achieve coherence in their work. They seek meaning in their results and they twist and turn until they get it. I think it’s funny that psychologists seek coherence in their experimental results but not their statistics. Granted, coherence is a technical term, but they cling to old, outdated, misleading techniques instead of embracing coherence. Just funny.

6/26

Will Gervais wrote a very interesting piece today looking at how PET-PEESE performs under certain conditions. His main interest (from what I can tell) is wanting to know how it performs under heterogeneous parametersĀ where the researcher correctly “guesses” (however that happens) the appropriate sample size to use for the randomly stipulated “true” effect size. Now, there are a few oddities in that piece. Will seems to be endorsing the claim that “the” effect size often doesn’t exist. That is, there typically isn’t 1 underlying effect size for all related experiments, so he samples each individual experiment’s parameter value from a distribution. Then he looks at which experiments are significant, puts them in the PET-PEESE crank, and tries to see how the method performs in terms of accurately concluding there is evidence for an effect or there is no evidence for an effect (there’s that roundabout language again) and what kind of parameter estimates it comes up with.

But this is so weird to me, because he goes on to see how frequently PET-PEESE misses “the” effect! Or how frequently PET-PEESE shows “negative bias” (underestimate “true” effect size). You are working in a simulation where there isn’t a true effect, in any sense of the word. There is an average parameter valueĀ (presumably), but that’s not nearly the same thing as a “true” effect. I mean, you pretty much say it in the post that “true” effect sizes vary per experiment, there is no one true effect, no one ring to rule them all, and then go on to show how PET-PEESE isn’t very good at measuring “true” effect sizes. What you should say is that PET-PEESE isn’t very good at finding the average of a distribution of randomly chosen parameters. But is that average parameter the same thing as a “true” effect? I’d say no. Hmm.

This might not be coherent, it’s late and it’s mostly me rambling on. But the point I want to get across is that I don’t think it makes sense to have figures and analysis built on the idea that you have found something biased against a “true” parameter value that does not exist by definition. So the bits related to Hetero +/- PB don’t make sense to me.

6/27

I realized I follow about 15 or 20 blogs, and most of them are run by men (~80%). Surely that’s disproportionate to what I should be following. Base rate of blogger genders aside (which might or might not matter), I feel a little sexist. I didn’t follow mostly men on purpose, but still, this feels important. So I asked on twitter for people to share their suggestions for blogs to follow that can even out this ratio. Man I got a lot of responses! I’ll have to storify the recommendations one of these days.

6/28

Erika Salomon posted today about p-hacking. God I hate that term. Anyway.. Interesting post. She uses one of Ryne Sherman’s R functions, or a modified version?, to simulate p-hacking. What she finds is neat, but not surprising. If you’re not surveying a null effect, it turns out that multiple looks at the data is super efficient. That’s not particularly new (see her edits), but it’s important to remind people. That increase in power and efficiency does not come without cost. If you are surveying a null effect, then you have a madly inflated alpha rate: 19%. Holy Shmokes. So while the procedure works pretty great if you’re mining real ore, if you’re in an empty quarry you’re gonna find a lot of pyrite. Worth it? As always, that depends. I don’t imagine many would go for it, but who knows, stranger things have happened. Oh yeah, I commented on it so read that if you’re interested. š I made a stupid mistake in one calculation, so just one of my daily reminders to myself that I’m human.

6/29

Coming up onĀ *six months* of this diary. Can you believe it? I can, I worked hard on this! But really, I feel pretty good about it. Perhaps I’ll do a recap post on the blog for my most interesting entries.

Also- BUSY day on twitter. ~90 tweets or so. Many interesting threads. I’ll link them here if you want to read them. My favorite: Helping Joe come up with an acronym for his potential new method. Super fun. I guess technically that started yesterday, but I’m putting it in today’s anyway. The best one:Ā Hilgard’s Innovative PET PEESE: Systematic Treatment (for) Effect-size Reconstruction (HIPPSTER).

Another thread: Don’t give statistics advice if you don’t know what you’re talking about. This goes for any kind of advice, really.

Another thread:Ā “Participants who sat at a wobbly workstation… saw their romantic relationships to be less likely to last.” GIVE ME A BREAK Psych Science. You are putting out some absurd papers. We politely request that you stop, or alas we will stir up a ruckus.

Another thread: Joe posted some thoughts on PET-PEESE simulations today. Dude’s smart, you should read it. Plus it has one of the best analogies of all time.

Another: Google please.

Another: Interesting conversation about bayesian simulations on the back of Erika’s blog announcement.

Another: The five stages of p-value understanding. From the always excellent JP de Ruiter.

Finally: A new version of the no free lunch paper (that I may or may not have referenced before somewhere in this diary). Quite a lot of discussion here, click on the “view other replies” if you have that option to see the full conversation threads.

What a day.

6/30

Micah Allen wrote a blog today (twitter thread) where he pretty much trashed a recent article from PNAS. The paper’s narrative is based on one key statistical test to show a difference between two groups. One group goes on a nature walk and the other through an urban environment. Method problems aside, Micah points out something particularly laughable. The test in question returned a p=.07 result. This being one of only two (I think) meaningful p values reported in the whole paper! If p values could quantify evidence, this would be weak. They can’t though, so I’ll just say that the results are not convincing to me. Implausible on their face, and backed up with measly results. Blah.

Also- Daniel Lakens messaged me today checking in about stuff. Thankful for the message! Kind words from an awesome dude.