Monkeying with Marcel
August 5, 2008 7 Comments
For those not familiar, the Marcel system is Sabermetrician Tom Tango’s system for projecting statistics in the coming year. The idea is simple enough, and he’s been rather emphatic about that being the entire point. The steps in short:
- Take three years worth of prior data (if available)
- Regress to the mean (if you don’t know what that is, more on that in a minute)
- Weight the data with the more recent data being weighed more heavily
- Apply an age adjustment (something I’ll skip for now)
- Let it rip
Tom’s stated (a few times) that this should be the basis for any good projection system, and he’s right, particularly around the issue of regressing prior year’s stats to the mean. A small demonstration, if you will.
Regression to the mean is simple to understand intuitively. Suppose that you have an amazing day today. You win the lottery and when you go to the drive thru at Arby’s and order the chicken fingers, they mistakenly give you five instead of four pieces along with a medium instead of a small Jamocha shake. Now that’s the recipe for a good day. What’s tomorrow going to be like? It will be worse, if only for the fact that there’s nowhere else to go but down. If you had a really bad day, the opposite thing happens. Now, let’s say that I based my entire measurement of how happy you usually are on an average day on how you’re doing based on that one day that you won the lottery. Or on that week? Really, if I want to get an idea of how good you feel on an average day, I need a bigger sample. But what if I only have 162 days?
Regression to the mean is a way to temper extreme observations, especially those drawn from limited time frames and/or from unreliable measures. On the first issue (limited time frames), the way to make a measure more reliable is to have a wider time sample. If Albert Pujols had a billion at-bats, I’d be a lot more comfortable in saying what his “true” talent is for hitting homeruns than if I only watched thirty at-bats. At 30 AB, I’d have some idea, but not the kind of precision that I would want to bet on. Some measures need more observations than other to become reliable, but alas, sometimes we only get a few at-bats to watch. Some measures are just unreliable by their nature, because they have much more to do with luck than any sort of skill. Think BABIP.
I’m a man who likes to look at the reliability of statistics. Nine months ago, I introduced the concept of split-half reliability and how it can be used to tell how reliable a stat is and when it becomes “reliable enough”. I did it by taking a sample of say 600 plate appearances and splitting them in half (even numbered ones vs. odd numbered ones). Then, I calculated whatever stat was interesting at the moment (K rate? 2B rate?) in the even-numbered plate appearances and in the odd-numbered plate appearances. I did this for everyone who had 600 PA’s to work with, and compared one set of 300 PA against the other set of 300 PA. If a statistic is reliable at 300 PA, then we should see that we get roughly the same rate from the even-numbered PAs as we do from the odd-numbered PAs. The way to check for that is through correlation between the two groups. The correlation that results is the split-half reliability of the stat at 300 PA. Why 300? Why not 299? Why not 301. Sure, the numbers aren’t going to change much from 299 to 300, but they will change. In fact, what’s to stop me from generating split-half reliabilities for a stat from 1 PA to 750? It’s just an engineering problem. I generated the appropriate numbers for BB rate and K rate. The one problem with generating these numbers is that it takes 24-36 hours of continuous computer processing (at least on my laptop) to generate one of those tables for a statistic. It’s do-able, it just takes a while.
Once we have the reliability for a measure given X observations, we also know how much to regress the measure to the mean. This is important because for some players, we have a sample of 700 PAs to work with and others we have 100. The split half reliability coefficient is “r”, and the formula for regressing something toward the mean is
r * player performance + (1-r) * league average.
I started by looking at batters and I found everyone’s actual BB and K rate for 1999-2007, and then they’re regressed BB and K rate for those years using the split half coefficients that I had just generated. Then, I lined everyone up from 2002-2007, with their 3 prior years worth of data (so, in 2002, back to 1999). I set up a regression equation to predict the “current” year’s BB rate, using the previous actual (non-regressed) BB rates from the three prior years. I limited the data to those players who got more than 250 PA in the “current” year. The actual rates did a pretty good job, and gave a formula of .416 * BBrate 1 year ago + .248 * BB 2ya + .148 * BB 3ya + .016. The regression had an R-squared of .590. Again, not bad. (There is the problem that those coefficients really don’t add up to 1.0 or anywhere near it. Trust me, that’s a problem.)
The regressed rates did a better job of predicting, with a best fit line of .545 * regBB 1ya + .264 * regBB 2ya + .215 * regBB 3ya – .002. R-squared: .614. Further, the standard error of the estimate was smaller in the regressed model (.0196 vs. .0201). The regressed predictors did a significantly better job.
If there’s one piece of the Marcel system that these data do call into question, it’s the weights which are placed on the previous year’s data. The Marcel system uses a 5/4/3 weighting system, with the most recent year being weighted with a five (so if I’m predicting 2009, the most recent year would be 2008), the second previous year gets a 4, and the third previous year gets a 3. In this case, with walks, it looks like about 53% of the weight in this equation is on the most recent year, with 26% on the second, and 21% on the third. Given a 12 point system, that suggests (with some rounding) a weighting of about 6.5/3/2.5 is most appropriate for predicting walk rate.
But let’s see if that holds up with strikeout rates. Same set up. Again, the regressed predictors did a better job than the actual non-regressed predictors. (R-squared: .735 vs. .694) The equation was .678 * regK 1ya + .186 * regK 2ya + .166 * regK 3ya – .008. Preojecting that out to the twelve point weighting system, that’s roughly 8/2/2.
Now, this is a very raw system. We do have other piece of of data and there may be other numbers that can be used to fine tune the predictions, including age (which the full Marcel system incorporates). But, there are two lessons to be learned here. One: regressing predictors to the mean is absolutely essential, both logically and in terms of the performance of the system. Two: while past performance does have a good influence on the future, different skills should be weighted and regressed in different ways. I’m not privy to how some of the other systems algorithims work, but hopefully, they’ve already incorporated the need to have a specific weighting system for each skill.