## So how long does it take for BABIP to become reliable?

January 25, 2009 7 Comments

Seems a simple question. We know that BABIP (batting average on balls in play) for pitchers has a low correlation from year to year. As a result, a Sabermetric standard has been that one year in a pitcher’s life tells you little about his actual ability to prevent hits on balls in play, which is true. In statistical terms: one year is not a sufficient sample to get a good estimate of the parameter, primarily because a pitcher only faces a few hundred balls in play each year.

Suppose though that a pitcher’s season lasted billions of plate appearances. Eventually, we’d know exactly how good a pitcher was. If we let him face another billion hitters, he’d come up with the same number again. That sort of sampling frame produces reliable statistics, but it’s a fantasy. We have to deal in reality.

But after looking at year-to-year stats, with the low correlation between BABIP at year 1 and BABIP at year 2 (which has held any which way you try to break it), it’s been assumed that pitchers have no control at all over their BABIP, ever. That’s a big jump, one that I think people make without fully stopping to realize that they’ve made. (I’ve probably made it myself.) There’s a difference between a parameter being entirely random and it being unobservable given our limited data and the amount of noise present.

The assumption goes that everyone is a .300 pitcher once the ball is in play and doesn’t leave the stadium. After all, if there’s no stability, it must all be random noise. Right? It’s just that no one has ever been really comfortable with that thought. Pitchers don’t differ in their BABIP ability *at all*? Pedro Martinez in his heyday was the equivalent of Mike Bacsik in his heyday? It just doesn’t make sense. Then there is the curious case of Troy Percival (my personal favorite piece of anecdotal evidence.) His BABIPs have been consistently below the magic .300 line throughout his entire career, and it’s been a long one. Could it happen by chance? Sure, but perhaps something else is afoot.

Maybe the problem is that we need to widen the sampling frame. Maybe one year doesn’t tell us much about a pitcher’s true talent on BABIP, but what if several years do?

I took 30 years worth of Retrosheet data (1979-2008) and dumped it into a giant file. I selected all balls in play (not a strikeout, not a walk, not a home run, not HBP, not one of those weird catcher interference thingies.) As I have been wont to do lately, I started running some split-half reliability analyses. I split each pitcher’s batters faced into even and odd numbered appearances (so, I’m drawing the first PA into the odd group, then the second into the even group… it balances out the two halves of a player’s performance so that I’m drawing some from year one, some from year two, etc.)

For each pitcher, I started by taking a sample of 500 balls in play and splitting them into two 250 BIP halves (those that had 500 to give). I ran a correlation between those two halves for all 1461 pitchers in the sample who fit the criteria. The correlation was .174. So, at 250 BIP, BABIP has a split half reliability of .174. It’s numbers like that which led to the creation of DIPS theory to begin with.

But let’s expand. Let’s take two samples of 500 BIP. That bumps things up to .253. Hmmmm, getting a bit more reliable. The question becomes when does it hit that “good enough” point. I’ve argued previously for the use of .70 as the cutoff for reliability. It’s an arbitrary point (I guess in an ideal world, we’d want a reliability of 1.0), but .707 has an R-squared of .50, which means anything north of that accounts for more than 50% of the variance. Can we get to .70?

Turns out that the answer is… yes.

At a sample of 3750 balls in play, (a 7500 BIP sample, chopped in half… there were 48 pitchers in the last 30 years who had that many BIP to look at… not outstanding, but enough to not discount), the split-half reliability was .696. At 4000, it reached .742 (in 34 pitchers). So, it only takes about 3800 BIP before we get a reliable read on a pitcher’s BABIP abilities. That’s a lot, but it’s not an obscene amount. In 2008, the average pitcher saw roughly 3 balls in play per inning pitched. At that rate, a starter who throws 180 innings would see about 540 BIP in a year (rough estimates here.) So, it would take about seven years, at that same 180 IP per year rate, to get to the required number of BIP. Not easy, but not out of the realm of possibilities.

Now, about those guys who had two matching 4000 BIP samples, there was still some variability in the sample. Andy Pettite had BABIPs in his twin samples of .318 and .312. Charlie Hough had the other extreme at .248 and .266. So, it looks like there is such a thing as the “ability” to exert some control over what happens to a ball in play. It just takes a while (but not forever) to reveal itself.

This isn’t a very functionally useful finding for evaluating players or predicting what they will do. A pitcher is not the same man he was at the beginning and end of seven years (either as a pitcher or a human being). The ability to prevent hits on BIPs may deteriorate over the years and at that point, we’re using data that are 6 and 7 years old to predict what will happen tomorrow. In a single season, which is really the sampling frame that most fans are concerned about, there will still be a lot of noise around the signal, but the signal is definitely there. Now if we can just get a better radio to pick it up.