Stats 202: Intraclass correlation OR Yet another DIPS paper
May 16, 2007 10 Comments
What part of a player’s performance is actual talent and what part is just luck? Any player’s performance has a little from column A and a little from column B. Sudden gusts of wind can turn fly ball outs into home runs and home runs back into fly ball outs, but in the end, we assume that it all evens out. Players’ stats generally (but not always) reflect their abilities and those abilities remain fairly constant from year to year. Even the casual fan can, before the season starts, make up a list who will be the top ten home run hitters in baseball in the coming season, and will probably be right on 7 or 8 of them. Even more, that same casual fan can probably give a pretty good estimation (perhaps we might say “in the ballpark”?) of how many home runs Player X will hit in the forthcoming year, assuming that he doesn’t get hurt.
Not all stats are stable like that. Pitching stats, in particular, tend to be a little more volatile from year-to-year. Those who dabble in Sabermetrics might be familiar with the wild variations in bullpen ERA that happen from year to year. Relievers are notorious for being a very fickle breed with good stats one year and bad the next. In fact, it was this instability that led Vörös McCracken to take a look at the stability of certain pitching measures over time. His method was to take a look at different metrics in 1998 and 1999 (the years are irrelevant, that’s just the data set he used), and to see how strong of a correlation there was between a pitcher’s performance in 1998 and in 1999. Those who were above the league average in ’98 should also be above average in ’99, and those below should be below, assuming that the statistic is measuring some actual skill (one that wouldn’t be affected by the simple passage of a year). If that doesn’t happen, we start to think that there is no relationship over time, and that the fluctuations have much more to do with luck.
McCracken found that some measures had high correlations from year-to-year, and some did not. In particular, statistics like a pitcher’s strikeout rate, walk rate, and home runs-given-up rate were fairly stable. These are events in which the defense behind the pitcher isn’t involved in the outcome of the play. The problem was that when McCracken looked at events in which the ball is in play, using the statistic batting average on balls in play (BABIP), which was a percentage of how often a ball in play went for a hit, he found relatively little correlation. Once the ball was in play, it was up to luck (and the seven guys behind him) to see what would happen to the ball, and his pitching line.
My goal in this tutorial is not to argue the merits of DIPS. That’s been done elsewhere by others (and perhaps in the future by me). Google Voros McCracken DIPS and you’ll find plenty of discussion on the topic. My goal, instead, is to use a better method than year-to-year correlation to test whether this information is the case. It’s not that year-to-year correlation is a horrible thing, just that intraclass correlation is better. Here’s the idea: Year-to-year correlations take into account two years worth of data. That’s nice, but we have so much more data available! Good old bivariate correlation is limited by the fact that it can only take two data points at a time into consideration. First, it finds the variance shared between the two variables and standardizes them by the product of their combined standard deviations. But, what if correlation could look at multiple years at once?
(I must put in a small disclaimer into the major nerdiness that now follows for my fellow stat-heads. There are several statistical measures that go by the name “intraclass correlation.” In what follows, I am specifically referring to the AR1 rho statistic found in variance decomposition/hierarchical/linear mixed models. If you didn’t understand what I just said, don’t worry.)
Suppose that I had three years of data: 2004, 2005, and 2006. Clearly, some of the same pitchers were active in all three years, although there were some who retired in 2005 (after retiring in 2004, then proceded to retire again in 2006). The point is that baseball didn’t completely reset itself with all new players in any of those years… except 1994 to Spring Training 1995. Let’s take a pitcher who pitched in all three years. We can look to see how closely correlated performances were from 2004 to 2005. We can do the same from 2005 to 2006. We can even do it with those who pitched in 2004 to 2006. If a player retired after 2005, we don’t need to worry about that. He will be represented in the 2004-2005 covariance term, but not in the 2005-2006 or 2004-2006 term, and the model is able to correct for that (the model is very robust against missing data).
Soon enough, we have what’s called a covariance matrix, a very specific type of one called an auto-regressive (level 1) covariance matrix, which is abbreviated AR(1). I’m not a skilled enough mathemetician to fully understand how this works (although I had someone explain it to me once, and it made sense then…), but in the same way that the variance between two variables is standardized by their multiplied standard deviations for bivariate correlation, this matrix can be standardized and simplified down into something comparable. As I understand it, the model can also correct for the fact that pitchers who pitched in 2006 are two years older than they were in 2004, and might have seen some erosion (or growth) in their skill set, and that we might expect less (or more) of them.
The end product of these numerical gymnastics is the intraclass correlation. Intraclass correlation (ICC) varies between 0 and 1, just like a real correlation (I don’t recall it ever being negative, but I’ve been wrong before), and gives an idea of how much of a performance is consistent within an individual over repeated observations (and how much is due to other factors, including luck). An ICC of zero means that the measure, whatever it happens to be, is not consistent over time at all, to the point where it could be considered entirely the product of other factors. An ICC of one means that performance is due entirely to factors endogenous to the individual. Needless to say, neither extreme ever happens.
The advantage of ICC over year-to-year correlations is simple: we could, in theory, expand our data set to include 20 years (although for methodological reasons, I wouldn’t recommend that). More data is always better. A few outliers in a dataset (pitchers who had ”career” years?) can really mess around with a correlation value and make it look less significant than it truly is. With ICC, the danger of this happening has been lessened through multiple observations. The number that it spits out can be read in much the same way as a correlation. In fact, squaring an ICC gives much the same information as a regular r-squared value, in this case, what percentage of the variance is accounted for by factors specific to the individual.
In short, ICC is a good measure of whether a particular statistic represents a skill that’s consistent from year to year or something that’s more given to variation due to luck. For those interested in prediction models, ICC holds the promise of giving more accurate terms to use when looking at regression to the mean.
OK, OK, so we know what it does, can we see an example? Let’s look at the four stats that McCracken looked at in his original work: walk rate, strikeout rate, home run rate, and BABIP. I took all MLB pitchers from 1999-2005 in the Lahman database (I didn’t have my updated 2006 version with me on my data stick), and selected for those with at least 50 IP in each year. I didn’t adjust for park factors (although I probably should… it’s nearing the end of the quarter here in professor-land). I calculated ICC’s for all four stats, and for comparison, calculated year-to-year correlations from 2004-2005. As always, when working with rate variables, it’s best to take the log of the odds ratio whenever running analyses with them.
Results were largely consistent with McCracken’s original model. Walk rate had an ICC of .609 and strike out rate checked in at .740. Year-to-year correlations were .609 and .754, respectively. For home run rates, the ICC was .364 (yty = .308), suggesting that there was less stability among HR rates than walk and strike out rates. ICC was slightly higher for home runs than the year-to-year correlation, although again, these are unadjusted for park. BABIP’s ICC? (I just did an entire sentence out of acronyms… I think that’s crossing a line.) .181, which compares to a year-to-year correlation of .098. BABIP, which is the cornerstone of DIPS theory looks a little more in control of the pitcher in the ICC model (an R-squared value of 3.2% compared to something a bit south of 1% in the year-to-year model.) Still, it looks like Mr. McCracken’s assumptions are right on.
There’s nothing revolutionary here in terms of findings. What I would hope is that in evaluating whether a metric is measuring a stable skill or a luck, that Sabermetricians would use the AR(1) method. It’s actually the first step in another technique that I like to use called hierarchical linear modeling. Perhaps more on that some day.