# Is baseball talent normally distributed?

January 18, 2009 11 Comments

A lot of the basic assumptions of sabermetrics are based around the notion that baseball talent is normally distributed, or at least approximates a normal distribution.

Briefly, the normal distribution describes the bell curve – what you tend to see with a normal distribution is a peak in the middle, at the average, with two tails that spread out evenly from there, in a very symmetric fashion.

The normal distribution is important to sabermetrics – its the basis in which regression to the mean works. Regression to the mean simply states that extreme observations tend to become less extreme in the long run. This is the basis of how sabermetricians estimate a player’s true talent level, and is how most projection systems work.

But how true is it?

What I’m using right now is a tool called quantile-quantile plots. I should warn that I’m not a professional statistician, nor do I really play one on the Internet. I’m a hobbyist, so take the following with a grain of salt; this is the best I can discern from the documentation for the GNU R statistics package and some light reading on the topic.

But the Q-Q plot has a straight line on it that’s supposed to represent how the data would look if the distribution was normal, as well as a graph of how the data is actually distributed. This is the Q-Q plot for RA, from pitchers with 20+ IP from 1998-2008:

I should probably take this moment to note that those cutoffs are wholly arbitrary.

But what we see, especially on the high end of the tail, is a high amount of non-linearity in our graph – in other words, pitching does not seem to wholly fit to the normal distribution. If we look at the center part of the band we can see that for many – maybe even most – pitchers, the normal distribution is a “good-enough” fit – there’s a good fit there in the middle.

The biggest issues seem to be with pitchers who have an RA of 6+, which to be honest we probably aren’t too concerned with. There is a more subtle – and yet more important – shift away from the line at pitchers with an RA below 2.

Now, the same, but for wOBA of hitters with 20+ PA from 1998-2008, pitchers excluded:

Again, in the middle the normal distribution seems to fit well, but at the extremes the assumptions of normality seem to be suspect. The issues above a .400 wOBA seem to be with performances that are very unsustainable to begin with, so I don’t know if they’re important. The sub-.300 wOBA issues, again – I don’t know if they come into play often enough to be a concern.

I’m sorta musing aloud in this post, and would love to get your thoughs on the issue. (This goes double for all of you who have a better stats background than I do!)

The normal assumption is the one that has us in the financial disaster we are in now (the tail effect on mortgage backed securities is very bad). But of course its very clear to any observer that talent is not normally distributed… there is only a handful of extraordinary talent who can sustain performance year to year, and a plethora of average to below average joes. After all you aren’t talking about a sample of the general population, its a sample with a built in bias as its based on the top .01% of baseball talent.

From a baseball perspective, rather than looking at normal distributions along any statistical factor, one should be spending time on how well predicted models fit/don’t fit. For example, my thinking would be that if one were to take pitching status for any individual that the fit of most models is pretty shoddy due to injury and the small margin of error between being a great and awful pitcher.

What happens when you raise your cuttoff (say, 500 PA for batters)? Players who only have 25 PA in their careers are pry more likely to have an extreme wOBA (either really high or really low) and would add to the wonkyness of the tails in the Q-Q plot.

The uptick on the bottom of that pitcher plot is nothing. Q-Q plots are awful at the extremes, and that’s not really all that far off from what I would expect. You want to see whether the middle holds, and it clearly does. The big jump after an RA of 6.00 is probably the fact that he’s probably the sixth starter/swingman/mop up guy/last guy out of the pen. Teams have a tendency to shuttle several guys into this role over the course of the season, letting 3 or 4 different guys be awful in the role rather than just one. (Maybe you catch lightning in a bottle?) So, there’s a lot of those guys floating around in the sample. I wonder what would happen to that graph if you limited it to the guys who were on an Opening Day roster.

On the hitter piece, you’re right that the deviation at the bottom isn’t an issue. Teams don’t bring guys up if they think they’re not going to hit at least a minimally decent wOBA. The uptick at the end is a little more interesting. But I think you’ve got the correct interpretation. When a guy is playing way over his head, you just let him go. When a guy is having horrible luck, he gets benched/traded/released/sent to the minors.

Bottom line: when you look at the data with an understanding of how baseball roster decisions are made, talent/performance does appear to be pretty normally distributed.

However, when we’re looking at RTM from the guys who had career years (Brady Anderson’s 50 HR), there might need to be a slightly stronger regression built into the projection system.

Mr. Cutter (if that IS your real name), I think it’s a bit misleading to say ‘Q-Q plots are awful at the extremes’. I think Q-Q plots are spot on top to bottom. The problem is that it’s rare for a real life dataset to strictly follow a normal distribution.

One thing you could do, Colin, is run a test for normality (i.e. Kolmogorov-Smirnov or Shapiro-Wilk’s tests) on the dataset. However, based on your graphs I think it’s safe to say the data wouldn’t pass the test. I’m not sure if the dotted lines indicate a confidence interval (most likely) or if they’re a prediction interval, but either way I’d have to say the data is significantly non-normal. And, to get back to my previous post, I think it might be useful to use a higher cutoff for your dataset. You’re probably not intersted in predicting the wOBA of a player whose career will consist of 25 PA. If you raised the cutoff to 500 PA (or even better: 1500 PA), you could assume that (most of) the players in the dataset actually have MLB level talent. Whether or not the talent of this subset is normally distributed might be interseting to know. Just a thought, though.

P.S. This site is pretty freakin awesome. Keep up the good work!

P.P.S. My previous posts were under the assumption (yes, the posts make assumptions) that each datapoint represents a player’s career wOBA, not indivual seasons. But either way, I think you know what I’m gettin at.

All great suggestions, some easier than others to implement.

That’s single-season wOBA, though – sorry for the confusion.

As for Kolmogorov-Smirnov or Shapiro-Wilk’s, I’ll see what I can do. The version of the S-W test I have only takes 5,000 entries and doesn’t seem to like my dataset for reasons I haven’t discerned. The K-S test, on the other hand, has a lot of parameters I don’t understand. Having neither a strong stats background or a lot of experience I often find the documentation for GNU R… insufficient. I’ve tried PSPP and a few others as well, and found them wanting as well. Oh well, you get what you pay for, don’t you.

I think that baseball talent, like many mental and physical skills, is (more or less) normally distributed across the entire world population, but we only observe the very best baseball players in the world who are good enough to make it to the majors, so what we see is the very very end of the right tail– hence there are an abundance of replacement level players and fewer 1-win players, even fewer 2-win players…and very fewer elite hall of fame caliber players.

R is a tough thing to learn without direction from a class or someone who really knows how to use it. The program can do just about anything you want, you just have to know what to type (one wrong character and it’s a mess). I can say that the help files are only helpful if you’re familiar with the code, but not so much otherwise. If your dataset has NA values or other problems, they need to be labeled as such (you have to tell R that NA means a missing value and can remove them from the analysis with na.rm=TRUE), otherwise certain tests won’t work (you may need to tell R to exclude rows with missing values, etc.).

I’m not sure why there is a 5000 limit. It could be a problem with the fact that the display can’t handle an entire dataset like Excel or SPSS. Is it a ‘maxprint’ error or is it specifically with the test you’re trying to run?

I also disagree with the dislike of QQ Plots. They can be very helpful, especially with large datasets. While they can be subjective, you can definitely see some skew or long tails in your plots. A simple histogram is helpful as well.

I remembered that I had gretl installed, and I’ve been meaning to give it a whirl anyway. It has a normality test that isn’t absurdly hard to use.

For RA:

Shapiro-Wilk W = 0.972357, with p-value 1.47346e-030

Shapiro-Wilk W = 0.986816, with p-value 3.35725e-046

If I’m reading those right, those are both not normal distributions. I suppose the next question is what sorts of distributions they are…

The K-S test in R 2.8.0 is:

ks.test(x,’pnorm’)

where x is the name of your dataset and ‘pnorm’ tells R to test x against a normal distribution. pnorm is in quotes because it’s a keyword. You could also put another dataset (say, y) in place of ‘pnorm’ to see if x and y are from the same distribution (though you wouldn’t necessarily know what that distribution was).

To nitpick, talent in the major leagues is definitely not normally distributed. Not by a long shot.

It’s the extreme right tail of the global population’s baseball talent. Yes, the global distribution of baseball talent is bell shaped, and major leaguers are selected from the right tail of that distribution.

However, when pitchers from the right tail face batters from the right tail, the outcomes of those match-ups will be normally distributed. That’s what you’re measuring, not talent but outcomes of interactions from (very) non-normally distributed samples of talent.

It’s a technical distinction, but one worth noting.