Is baseball talent normally distributed?
January 18, 2009 11 Comments
A lot of the basic assumptions of sabermetrics are based around the notion that baseball talent is normally distributed, or at least approximates a normal distribution.
Briefly, the normal distribution describes the bell curve – what you tend to see with a normal distribution is a peak in the middle, at the average, with two tails that spread out evenly from there, in a very symmetric fashion.
The normal distribution is important to sabermetrics – its the basis in which regression to the mean works. Regression to the mean simply states that extreme observations tend to become less extreme in the long run. This is the basis of how sabermetricians estimate a player’s true talent level, and is how most projection systems work.
But how true is it?
What I’m using right now is a tool called quantile-quantile plots. I should warn that I’m not a professional statistician, nor do I really play one on the Internet. I’m a hobbyist, so take the following with a grain of salt; this is the best I can discern from the documentation for the GNU R statistics package and some light reading on the topic.
But the Q-Q plot has a straight line on it that’s supposed to represent how the data would look if the distribution was normal, as well as a graph of how the data is actually distributed. This is the Q-Q plot for RA, from pitchers with 20+ IP from 1998-2008:
I should probably take this moment to note that those cutoffs are wholly arbitrary.
But what we see, especially on the high end of the tail, is a high amount of non-linearity in our graph – in other words, pitching does not seem to wholly fit to the normal distribution. If we look at the center part of the band we can see that for many – maybe even most – pitchers, the normal distribution is a “good-enough” fit – there’s a good fit there in the middle.
The biggest issues seem to be with pitchers who have an RA of 6+, which to be honest we probably aren’t too concerned with. There is a more subtle – and yet more important – shift away from the line at pitchers with an RA below 2.
Now, the same, but for wOBA of hitters with 20+ PA from 1998-2008, pitchers excluded:
Again, in the middle the normal distribution seems to fit well, but at the extremes the assumptions of normality seem to be suspect. The issues above a .400 wOBA seem to be with performances that are very unsustainable to begin with, so I don’t know if they’re important. The sub-.300 wOBA issues, again - I don’t know if they come into play often enough to be a concern.
I’m sorta musing aloud in this post, and would love to get your thoughs on the issue. (This goes double for all of you who have a better stats background than I do!)