# Stats 202: Intraclass correlation OR Yet another DIPS paper

May 16, 2007 10 Comments

What part of a player’s performance is actual talent and what part is just luck? Any player’s performance has a little from column A and a little from column B. Sudden gusts of wind can turn fly ball outs into home runs and home runs back into fly ball outs, but in the end, we assume that it all evens out. Players’ stats generally (but not always) reflect their abilities and those abilities remain fairly constant from year to year. Even the casual fan can, before the season starts, make up a list who will be the top ten home run hitters in baseball in the coming season, and will probably be right on 7 or 8 of them. Even more, that same casual fan can probably give a pretty good estimation (perhaps we might say “in the ballpark”?) of how many home runs Player X will hit in the forthcoming year, assuming that he doesn’t get hurt.

Not all stats are stable like that. Pitching stats, in particular, tend to be a little more volatile from year-to-year. Those who dabble in Sabermetrics might be familiar with the wild variations in bullpen ERA that happen from year to year. Relievers are notorious for being a very fickle breed with good stats one year and bad the next. In fact, it was this instability that led Vörös McCracken to take a look at the stability of certain pitching measures over time. His method was to take a look at different metrics in 1998 and 1999 (the years are irrelevant, that’s just the data set he used), and to see how strong of a correlation there was between a pitcher’s performance in 1998 and in 1999. Those who were above the league average in ’98 should also be above average in ’99, and those below should be below, assuming that the statistic is measuring some actual skill (one that wouldn’t be affected by the simple passage of a year). If that doesn’t happen, we start to think that there is no relationship over time, and that the fluctuations have much more to do with luck.

McCracken found that some measures had high correlations from year-to-year, and some did not. In particular, statistics like a pitcher’s strikeout rate, walk rate, and home runs-given-up rate were fairly stable. These are events in which the defense behind the pitcher isn’t involved in the outcome of the play. The problem was that when McCracken looked at events in which the ball is in play, using the statistic batting average on balls in play (BABIP), which was a percentage of how often a ball in play went for a hit, he found relatively little correlation. Once the ball was in play, it was up to luck (and the seven guys behind him) to see what would happen to the ball, and his pitching line.

My goal in this tutorial is not to argue the merits of DIPS. That’s been done elsewhere by others (and perhaps in the future by me). Google Voros McCracken DIPS and you’ll find plenty of discussion on the topic. My goal, instead, is to use a better method than year-to-year correlation to test whether this information is the case. It’s not that year-to-year correlation is a horrible thing, just that intraclass correlation is better. Here’s the idea: Year-to-year correlations take into account two years worth of data. That’s nice, but we have so much more data available! Good old bivariate correlation is limited by the fact that it can only take two data points at a time into consideration. First, it finds the variance shared between the two variables and standardizes them by the product of their combined standard deviations. But, what if correlation could look at multiple years at once?

(I must put in a small disclaimer into the major nerdiness that now follows for my fellow stat-heads. There are several statistical measures that go by the name “intraclass correlation.” In what follows, I am specifically referring to the AR1 rho statistic found in variance decomposition/hierarchical/linear mixed models. If you didn’t understand what I just said, don’t worry.)

Suppose that I had three years of data: 2004, 2005, and 2006. Clearly, some of the same pitchers were active in all three years, although there were some who retired in 2005 (after retiring in 2004, then proceded to retire again in 2006). The point is that baseball didn’t completely reset itself with all new players in any of those years… except 1994 to Spring Training 1995. Let’s take a pitcher who pitched in all three years. We can look to see how closely correlated performances were from 2004 to 2005. We can do the same from 2005 to 2006. We can even do it with those who pitched in 2004 to 2006. If a player retired after 2005, we don’t need to worry about that. He will be represented in the 2004-2005 covariance term, but not in the 2005-2006 or 2004-2006 term, and the model is able to correct for that (the model is very robust against missing data).

Soon enough, we have what’s called a covariance matrix, a very specific type of one called an auto-regressive (level 1) covariance matrix, which is abbreviated AR(1). I’m not a skilled enough mathemetician to fully understand how this works (although I had someone explain it to me once, and it made sense then…), but in the same way that the variance between two variables is standardized by their multiplied standard deviations for bivariate correlation, this matrix can be standardized and simplified down into something comparable. As I understand it, the model can also correct for the fact that pitchers who pitched in 2006 are two years older than they were in 2004, and might have seen some erosion (or growth) in their skill set, and that we might expect less (or more) of them.

The end product of these numerical gymnastics is the intraclass correlation. Intraclass correlation (ICC) varies between 0 and 1, just like a real correlation (I don’t recall it ever being negative, but I’ve been wrong before), and gives an idea of how much of a performance is consistent within an individual over repeated observations (and how much is due to other factors, including luck). An ICC of zero means that the measure, whatever it happens to be, is not consistent over time at all, to the point where it could be considered entirely the product of other factors. An ICC of one means that performance is due entirely to factors endogenous to the individual. Needless to say, neither extreme ever happens.

The advantage of ICC over year-to-year correlations is simple: we could, in theory, expand our data set to include 20 years (although for methodological reasons, I wouldn’t recommend that). More data is always better. A few outliers in a dataset (pitchers who had ”career” years?) can really mess around with a correlation value and make it look less significant than it truly is. With ICC, the danger of this happening has been lessened through multiple observations. The number that it spits out can be read in much the same way as a correlation. In fact, squaring an ICC gives much the same information as a regular r-squared value, in this case, what percentage of the variance is accounted for by factors specific to the individual.

In short, ICC is a good measure of whether a particular statistic represents a skill that’s consistent from year to year or something that’s more given to variation due to luck. For those interested in prediction models, ICC holds the promise of giving more accurate terms to use when looking at regression to the mean.

OK, OK, so we know what it does, can we see an example? Let’s look at the four stats that McCracken looked at in his original work: walk rate, strikeout rate, home run rate, and BABIP. I took all MLB pitchers from 1999-2005 in the Lahman database (I didn’t have my updated 2006 version with me on my data stick), and selected for those with at least 50 IP in each year. I didn’t adjust for park factors (although I probably should… it’s nearing the end of the quarter here in professor-land). I calculated ICC’s for all four stats, and for comparison, calculated year-to-year correlations from 2004-2005. As always, when working with rate variables, it’s best to take the log of the odds ratio whenever running analyses with them.

Results were largely consistent with McCracken’s original model. Walk rate had an ICC of .609 and strike out rate checked in at .740. Year-to-year correlations were .609 and .754, respectively. For home run rates, the ICC was .364 (yty = .308), suggesting that there was less stability among HR rates than walk and strike out rates. ICC was slightly higher for home runs than the year-to-year correlation, although again, these are unadjusted for park. BABIP’s ICC? (I just did an entire sentence out of acronyms… I think that’s crossing a line.) .181, which compares to a year-to-year correlation of .098. BABIP, which is the cornerstone of DIPS theory looks *a little* more in control of the pitcher in the ICC model (an R-squared value of 3.2% compared to something a bit south of 1% in the year-to-year model.) Still, it looks like Mr. McCracken’s assumptions are right on.

There’s nothing revolutionary here in terms of findings. What I would hope is that in evaluating whether a metric is measuring a stable skill or a luck, that Sabermetricians would use the AR(1) method. It’s actually the first step in another technique that I like to use called hierarchical linear modeling. Perhaps more on that some day.

A correlation of .181 isn’t that bad. There are those who will see that (especially if this is posted at BTF) and come to the conclusion that Voros was dead wrong. Kind of irrelevant, but for a lot of people Voros being right or wrong is more important than what this actually means.

For the other stats, y-t-y is pretty close to the ICC. Any idea why ICC is so much higher for batting average?

Year-to-year correlations on BABIP:

1999-2000 .095

2000-2001 .281

2001-2002 .130

2002-2003 .118

2003-2004 .204

2004-2005 .098

To compare, the analogous figures for strikeout rate are .732, .714, .744, .777, .737, .754. Less variability there.

It looks like the correlations on BABIP are a little unstable, which the ICC corrects for through multiple observations. .181 is still only a little more than 3% in R-squared, so while Voros might have said it was a little bit less, 3% is still very little control over BABIP.

Why the hate for Voros? He came up with a theory, it fits the data rather well. Show me data that contradict him and we’ll chat.

Good stuff PC. Does anyone know whether/how you can do this in R or SPSS?

Is AR(1) the same as a normal y-t-y correlation if you add all the years in the ICC together eg, 2004&2005, 2005&2006, 2004&2006?

John, on SPSS, you have to have one observation (generally, one player-season) per line. The Lahman database is alread set up this way. I think you have to have a fairly recent version of SPSS. I have 13 on my machine (I did some beta testing for them) but I don’t think I can do it on 11.5 which I have on one of my work computers. I use SPSS for all of my analyses. I actually have no idea how Excel works!

Go to Analyze > Mixed Models > Linear. At the very bottom of the first screen, choose an AR(1) covariance matrix. Player ID goes in the top part (subjects), year goes in the bottom (the repeated measure). At the next screen, put whatever variable you want in as the dependent varaible. AR(1) rho will automatically generate. You can also run additional parameters (fixed and random effects).

As best as I understand AR(1), it’s much more than a simple average of the y-t-y correlations. It takes into account trend lines at the individual level (hence it is auto-regressive), although I’m not sure how mathematically. If someone else with more experience in matrix algebra is out there, please bail me out of this!

I see, its a little higher than the simple average of y-t-y, but makes sense.

“Why the hate for Voros?”

I have no hate for him, in fact I’ve defended his work many times on BTF, but if you haven’t read the DIPS threads in the archives there, you are in for a treat. It really touches a nerve with some of the posters there.

“Show me data that contradict him and we’ll chat.”

Well, DIPS assumes zero y-t-y correlation for BABIP. Its less than for other pitching components, but certainly not zero. DIPS regresses K,W,and HR 0% and BABIP 100%. Its an over simplification, but it does give you reasonable results. For correlation, homeruns are right in the middle between what is considered meaningful and what isn’t. You could probably regress HR 100%, just look at K and W, and you’d have a simple model that’s as effective as DIPS.

I agree with Sean. An R of 0.18 isn’t *that* bad in the baseball context. One st dev of BABIP in year 1 is 0.18st dev in year 2. I know the R^2 is 3% but I think the more relevant metric here is R, not R^2.

Compared to other factors the effect is small but it certainly exists.

Sean, I wasn’t implying that you were a Voros-hater… just wondering out loud why anyone would argue against the conclusions of well-done research by yelling. Ah yes, yelling is easier than statistics. I’ll have to check the BTF archives.

In order to increase the r, you can increase the variance of your population (i.e., introduce bad pitchers). All of a sudden, your r goes from .18 to .25, without anything else changing.

K has a high correlation because of the huge spread in K rates to begin with.

There’s no such possibility in MLB for BABIP, since 75% of the PA end with a BIP. Jeff Weaver’s .500 BABIP this year, even if true/real, couldn’t exist long enough for us to detect, since you’ll never be allowed to pitch. A guy can K at half the league rate, if he can walk guys at half the league rate as well.

All the correlation shows is if you can see the signal in the noise, and does not tell you how real the signal is.

“All the correlation shows is if you can see the signal in the noise, and does not tell you how real the signal is.”

I agree with Tango, but would say it slightly differently. Correlation tells us the ratio of signal to noise, but doesn’t tell us how significant the signal is to baseball outcomes. For one thing, for BABIP the noise is greater than the other stats: SD for 750 BIP is about .017, while SD for K/PA on 1,000 PA is .011. More importantly, a proportionate change in BABIP has much more impact on runs allowed: a .270 pitcher will be much more successful than a .300 pitcher, but a similar 10% difference in K-rate is no big deal (the BABIP difference equals .7 runs/game, vs. .2 R/G for the K difference).

Do this thought experiment: suppose that the ICC for HBP/9 was 1.0. All signal, no noise. Would that make it more of “real skill” than K-rate or BB-rate? We still wouldn’t care, because it has a trivial impact on RA. Skills matter to the extent they help you win games. All that matters is the amount of variation in true talent, in terms of the impact on RA. If you look at it that way, you’ll find that the true talent variations in BABIP, while appearing small, are nearly as consequential as differences in the other 3 skills. For example, Clay Davenport looked at AAA pitchers who made the majors vs. those who didn’t, and the BABIP difference between the two groups was roughly comparable to the other 3 metrics in terms of RA.

We should stop talking about correlation as telling us how “real” a skill is. The amount of noise is completely irrelevant, except in the sense that it makes it harder for us to figure out who has the skill. What matters is the size of the signal, translated into runs.

In R I think you would use the NLME package to perform the mixed effects model, and specify that the model look for autocorrelation.