In 2001, Voros McCracken published perhaps the most important, most influential piece of baseball research ever conducted. After separating pitchers’ lines into defense independent statistics (strikeouts, walks, home runs, hit by pitch) and defensive dependent statistics (balls in play), he found that while pitchers have much control over their defensive independent statistics (though to varying degrees), they seem to have little or no control over their defensive independent statistics. Specifically, what McCracken found was that pitchers’ defensive independent lines remained stable from year-to-year, while their defense dependent line in one year seemingly told us almost nothing about what they would do in the next year.
Since, various baseball researchers have refined his methods and both proven and refuted his DIPS (Defensive Independent Pitching Statistics) theory. At this point, we know quite a bit about pitchers’ control over whether or not a ball in play becomes a hit. We know that pitchers have quite a bit control over whether or not a ball in play becomes a fly ball or a ground ball. We also know that fly balls become outs more often, but that they also become extra base hits more often, and those two basically cancel out. We know that pitchers have some control over Batting Average on Balls in Play (BABIP), but that one season of information on BABIP doesn’t really tell us much. And, we know a whole lot more, but that’s not what this post is about.
What this is about is the math behind DIPS. What interests me (and I hope it interests you as well!) is why DIPS works the way it works. I want to know why a pitcher’s BABIP in one year does not seem to have any impact on his BABIP in the next year, especially given that we know that over long periods of time, pitchers do have a discernable impact on BABIP. Let’s look at this from a few different angles:
This is the method McCracken used to first arrive at, and then prove his DIPS theory. Correlation tells us how well two variables track each other. If they have no relationship, the correlation will be 0. If the two variables track each other perfectly, the correlation will be 1. If they track each other perfectly, but in opposite directions (i.e., a +1 change in one means a –1 change in the other), the correlation will be –1. Correlations are bound at –1 to 1, and generally, a correlation of .7 or better is considered great, .5 or better is good, and anything lower is questionable.
If we take all players with at least 500 BIP in both 2004 and 2005 (58 in all), we find that the correlation between BABIP in 2004 and BABIP in 2005 is only .110. Is that significant? In short, no. The P-value is .409, meaning that there is a 41% chance that there is no relationship between the two variables (generally, statisticians use P = .05 as the threshold of significance). Even if we accept the correlation as significant, what it tells us is that one-year worth of BABIP information tells us almost nothing.
That’s because of a concept known as regression to the mean. For example, in his first 12 games of the season, Chris Shelton hit 8 home runs. But no one expects him to hit two home runs every three games the rest of the year. Why? Well, intuitively we know that no one can hit home runs at that kind of pace, and more so, since most people probably expected him to hit 20-25 home runs all year, we certainly would not expect him to hit this many home runs. Mathematically, this is known as regression to the mean. If we know that Shelton is expected to hit 20-25 home runs every 150 games or so, we know that those 8 home runs in 12 games are pretty damn fluky. In reality, we still expect him to hit a home run every 6-7.5 games. So, in 2006, we’d still expect Shelton to hit somewhere between 30-35 home runs, even accounting for the fact that he’s probably a bit better of a home run hitter than we thought he would given his hot start.
Mathematically, regression to the mean is determined by correlation. The formula is simple: Regression to the mean = (1 – r), where “r” stands for correlation. Take, for example, our sample of players. The average BABIP among all players in 2004 and 2005 was .283. Carlos Zambrano had a BABIP of .266 in 2004. Thus, his predicted BABIP in 2005 would be, (1 – .110)*.283 + .110*.266 = .281. Though Zambrano allowed 10 hits on BIP less than the average pitcher in 2004, we would still expect his BABIP in 2005 to only be two-hundredths of a point lower than average.
Let’s look at it another way. The standard deviation of BABIP in 2004 was .016 points. Standard Deviation (SD) is a measure of spread: 68% of all players will be within one SD of the mean, 95% will be within two SD of the mean, and virtually all will be within three. What that means is that in 2004, we’d expect almost every pitcher in our sample to be within .048 points of average, or .283 +/- .048, which is from .235 to .331. In fact, BABIP in our 2004 sample ranged from .240 to .321, so that’s good. However, because we’re regressing 89% of the way to the mean, our predicted BABIP in 2005 would only have a Standard Deviation of less than .002 points! We would expect almost everyone to be within about five-hundredths of a point of average. The spread from best to worst would be less than seven hits!
In reality, though, the spread from best to worst is more like 60 hits. Obviously, some of that is due to luck, but even over large samples, the difference between best and worst is much more than seven hits. In fact, based on research by Erik Allen and Arvis Hsu, the “true” spread from best worst is about 35 hits. What this tells us is that a sample of just one season is not nearly enough to tell us much about a pitcher’s ability to prevent hits on balls in play. That’s why DIPS works: With one year of information, BABIP is practically (or maybe totally) worthless.
One great thing about baseball is that a lot of events on the field are binomial. A binomial is any event where there are only two possible outcomes, a success and a failure. On balls in play, there are only two possible outcomes: It’s either a hit or an out (okay, there are errors as well, but for our purposes, those count as outs). What’s great about a binomial is that we have a formula for determining random variance in a binomial, that is, how great a spread (otherwise known as a Standard Deviation; see, earlier concepts prove important) we would expect just based on luck of the draw. For example, if we flip a coin 100 times, we would expect 50 heads and 50 tails, but 32% of the time, would have 45 heads or 45 tails, 5% of the time we would have 40 heads or 40 tails, and almost always, we would expect to have no less than 35 heads or 35 tails.
The formula for random variance in a binomial is simple:
SQRT(Prob(Success)*Prob(Failure)*Number of Trials)
In our coin flip example, that would be, SQRT(.5*.5*100) = 5. Okay now, let’s look at BABIP. There were 179 pitchers who had 500 BIP in 2004 or 2005. Their average BABIP was .285 (the league average over those two years was .283—remarkably close!), with a standard deviation of .019, and an average of 625 BIP. How much of that would be due to random variance? Well, let’s do the math: SQRT(.285*.715*625) = 11.285 hits. 11.285/625 = .018.
That is, given that random variance accounts for pretty much the whole spread in one year’s worth of BABIP! Of course there will be virtually no year-to-year correlation when there’s so much noise. This is why DIPS works given one-year samples: The noise in BABIP is so powerful that it overpowers any true ability.
Let’s look, for example, at a group of pitchers with a large sample size: At least 5,000 career BIP. There are 312 such pitchers who started there career no earlier than 1946 (from Bob Lemon to Bartolo Colon), and they averaged 7,530 career BIP. Their average BABIP is .277. Doing the math, the expected Standard Deviation among these pitches will be about .005 points of BABIP. The actual? .012.
In fact, if we square those both to find the variances (this is mathematically necessary; we can’t just subtract standard deviation from standard deviation), and then subtract the random variance from the actual, and then take the square root of that to find the “true” standard deviation, we get .011, remarkably close to Allen and Hsu’s conclusion of .009. The rest can probably be written off as the impact of fielding, which we did not control for but Arvin and Hsu did.
So again, over large samples, BABIP is quite meaningful. Over small samples, it don’t mean a thing.
Another way of looking at this issue is by looking at distributions. What we’re interested in here is if groups of players that do well in BABIP one year do well the next, and vice-versa. There’s a mathematical way of doing this called a chi-squares test. Here’s how it works: I divided up our sample of 58 players with at least 500 BIP in both 2004 and 2005 into five groups based on their BABIP in 2004. Each group had 11 pitchers but the middle one, which had 13. I then looked at the number of hits on balls in play each group allowed in 2005, versus how many would be expected if everyone had the same exact skill at preventing hits on balls in play—that is, if there were no difference in BABIP “skill” between major league pitchers, which was basically Voros’ original postulate.
Let’s look at the table:
Group 1 Group 2 Group 3 Group 4 Group 5
Observed 2090 2005 2489 2003 2154
Expected 2083.6122 2010.2043 2550.0176 1993.4573 2103.7087
(O-E)^2/E 0.0196 0.0135 1.46 0.0457 1.2023
What this tells us, basically, is that the players that were best in 2004 at preventing hits on balls in play were actually a bit worse than average at doing so in 2005. Meanwhile, those that were about average in 2004 (Group 3) were the best in our sample in 2005.
The chi-square value is found by subtracting the expected value from the observed, squaring the difference, and then dividing that by the expected value, which is what I’ve done in the third row, and then adding all those numbers together. Our chi-square value is 2.74, which with 4 degrees of freedom (which are determined by subtracting 1 from the number of categories) is highly insignificant. Specifically, our P-value turns out to be .60.
Even if we group the players, which increases our sample size, and therefore decreases the noise-to-signal ratio, we still find no evidence that one year of BABIP information is at all useful. This is now the third way we have proven Voros’ hypothesis. This is why DIPS theory works: Because, as the chi-squares test shows, with one year worth of BABIP, it’s better to simply replace a pitcher’s actual BABIP with the league average.
This post is not about whether or not DIPS is right or wrong. I’m certainly not saying that pitchers have no control over the result of Balls in Play, because, well, they do. If you want my thoughts on where they do and do not have control, I recommend you buy The Hardball Times Annual 2006, and read my article with JC Bradbury.
What this post is about is why DIPS theory seems to work. It’s about why Voros McCracken made the findings that he did. The reasons for why something is the way it is are as important, and sometimes more important, than simply knowing what is. They give us explanations for observed phenomena, which allows us to refine and draw further conclusions. For example, when Voros said that pitchers have little or no control over whether or not a BIP becomes a hit, he was wrong. What he should have said is that there is so much luck involved in whether or not a BIP becomes a hit that a one-year sample is practically meaningless.
That’s why no test will find any meaning in a one-year sample. In reality, what he was looking at was not Defensive Independent Pitching (DIPS), but rather Luck Independent Pitching (LIPS). It’s not that pitchers had no impact on whether Balls in Play became hits or not; it’s that luck had such a large effect, that any pitcher control would simply be drowned out.