# A second look at Pythagorean win estimators

March 28, 2007 13 Comments

Over the past few days on the SABR statistical analysis list-serv, there’s been a bit of chatter about the Pythagorean win estimator. My guess is that most of the folks reading this post are familiar with the formula, but for the benefit of those who may not be, the formula was created by Bill James, in an attempt to model how many games a team “should” have won, based on how many runs they scored and how many they allowed. The original formula read: Winning % = RS^2 / (RS^2 + RA^2). It’s eerie resemblance to the Pythagorean theorem in geometry (the one you hated in high school) gave it a name. Several different modifications have been suggested in the intervening years, including changing the exponent to 1.82 (some say 1.81), and two “dynamic exponent” formulas (one by Clay Davenport, the other by David Smyth) which have a formula to calculate the proper exponent, which is then substituted in on a case-by-case basis.

Before coming on board here at MVN, I had meditated briefly on these formulae and their merits relative to each other, with the Smyth formula coming out the winner, if only by a tiny margin. In evaluating any estimator, there are two important questions to answer: how closely does it predict the observed values (in this case, the team’s actual winning percentages) and are the mistakes (in statistics-speak, residuals) in some way biased. In my original post, I found that the residuals were essentially centered around zero (very good!) and the standard deviation of the residuals for all four of the formulae was somewhere in the neighborhood of 4.3 wins. Additionally, the residuals all showed a minimal amount of skew.

There are a few more residual diagnostics to run to check for any additional biases in the estimators. For example, if the estimators over-estimate the winning percentages of good teams, but under-estimate the winning percentages of bad teams (or vice versa, for that matter), then there is a built-in bias to the estimator. Along with being accurate, no matter the team quality, an estimator should work no matter how many games were played in the season, and how many runs the team scored and/or gave up.

I used the Lahman database for this one, and selected out all teams who played at least 100 games. This gave me a database of 2370 team-seasons to work with. I calculated the projected winning percentages for the Pythagorean, Exp 1.82, Davenport, and Smyth formulae, and then subtracted each of them from the actual winning percentage to get the residual for each.

I calculated (well OK, my computer calculated them) correlation coefficients for the residuals of each formula and the following variables: games played, runs scored per game, runs allowed per game, wins, and actual winning percentage. None of the formulae were correlated with games played. There were small correlations observed between the original Pythagorean formula and runs scored per game (.106) and runs allowed (-.071). No other such correlations were observed. Those correlation values were significant, although are rather small in magnitude.

The biggest finding in my analyses was the fact that the residuals from the Exp 1.82 formula, Davenport, and Smyth formulae were all correlated with wins and winning percentage. The Exp 1.82 formula, likely the most-used and reported formula, showed correlation coefficients of -.346 and -.380, respectively. The Davenport (-.253 and -.269) and Smyth (-.256 and -.273) correlation coefficients were lower, although still notable. The original Pythagorean formula residuals had much lower correlations of -.095 and -.101. These findings suggest that Exp 1.82, Davenport, and Smyth all have a bias such that better teams are more likely to have their estimates in the formulas be lower than their actual winning percentage. Poor teams are more likely to have their estimates be higher than their actual winning percentage.

If the previous sentence made your head spin, here it is in English with numbers made up on the spot for pure illustrative purposes: Let’s say that a team won 94 games in the year in question. The Exp 1.82, Davenport, and Smyth formulas are more likely to be wrong in the direction of saying that the team should have won fewer games (91). A poor team that won 61 games is more likely to have their projection be much higher (perhaps 65).

So what? Since these formulas became popular, the differences between the projections and the actual results have been taken to indicators of such things as manager ability. (A less-than-proper use of the formula in my opinion, but it is the common application.) If a team wins more than its projection, the manager must be doing a good job, because he’s maximizing runs at the proper time to win games. If a team wins fewer than projected, the manager might be fired. If the formulas are biased though, some of the credit and blame being passed along due to them may be a statistical artifact. The bias built into the formula would make a manager from a last-place team look like he is underperforming, even as he now has to answer to the GM on having just lost 101 games. On the other hand, the manager on the successful team is more likely to look like he is over-performing and maybe will get a nice contract extension and raise out of it. Managers on bad teams look even worse, and managers on good teams look even better.

It looks like the Pythagorean estimators need a little bit of tinkering. They don’t need to be thrown out. In fact, to the contrary, they perform exceptionally well overall. The bias I identified is going to be most noticeable at the extremes, which is a common problem in estimators of this type. Analysts just need to be a little more careful in interpreting the results in those cases.

Remember: Even the Scarecrow didn’t get the Pythagorean theorem exactly right on the first try.

I don’t want to sound mean here, but your conclusion is completely wrong (happens to the best of us). What this actually means is that teams with bad records tend to be unlucky and teams with good records tend to be lucky (that’s why we incorporate regression to the mean when trying to predict the future). One standard deviation over 162 games is more than six wins, meaning that just by chance, an average team will win 87 games 16% of the time, and win 75 games 16% of the time. But in both those scenarios, its Pythagorean record is likely to be closer to 81-81 than its actual record.

Have you looked at Pythagopat? The exponent depends on the run level, its (runs + opp runs) ^0.28

Pythagoras sure didn’t work for the Indians last year. Another team it didn’t work for was the 2004 Springfield Isotopes, from my fictional player APBA league. Check out their story, with blunt headline, here: http://home.comcast.net/~briankaat/herm081504.pdf

David, I’m a bit of a skeptic on whether the Pythagorean estimators actually measure “luck”, the same way the I’m not convinced that they measure manager ability. (My original draft of the post discussed both concepts.) If you believe in Pythagorean residuals as a measure of luck, then your conclusion is about right, but you’ll have to forgive my skepticism on this one. It’s a psychological (my day job) fallacy to believe that winners necessarily had luck on their side. They may have, but it’s not required.

I agree with you on the need for regression to the mean, but there’s nothing inherent about a team winning 95 games that says that deep down, they weren’t a “true” 100 win team that got a little un-lucky.

I have an article on game by game pythagenpat averages over at detectovision.com…unfortunately…you have to be a premium member to read it, but I’ll probably comment on that here as well at some point.

The Cleveland Indians weren’t NEARLY as good as their (seasonal) Pythagenpat would suggest. They had way WAY too many blowout games throwing off their season RS total. In other words…the Indians were bullies. They beat up on really bad pitching and couldn’t go toe to toe with good pithcing.

I’m inclined to side with David’s view. Luckily, that view is testable. Selecting teams by actual wins does set up a regression to the mean problem the way Pythagorean wins are usually conceptualized. We assume team ability comes first, and that wins are a measure of ability with less than perfect reliability (Wins=Ability+Luck, Luck=N(m,s)). Lets take an example, say 100 wins. Because we’re assuming luck is unbiased measurement error, one would think that this team was as likely to have 99 win ability as 101, 98 as 102, etc. However, we already know that the distribution of abilities is centered at 81, so we should expect more (predicted) 99 win teams than 101, more 98 than 102, etc (P(win=100|A=99)=P(win=100|A=101)). Because of this, any method that selects on actual wins analyzes the distribution of predicted wins will show the above bias, as there are simply more 100 and below teams in the distribution than 100 and above. To put it another way, because we have imperfect prediction, the distribution of predicted wins should have a smaller variance than the distribution of actual wins (Variance(Wins)=Variance(Abilities)+Variance(Luck), as the covariance in Abilities and Luck is 0).

To test this, you should be looking at bias in the distributions of wins at a given level of Pythagorean Win%, not Pythagorean around wins. If the Pythagorean model is correct, you won’t see bias around the predicted win values.

Hope this helps,

Ryne

A misunderstood formula: Variance Observed = Variance Talent + Variance

Error, not Variance Luck. Error might be expected random variation, it might be “luck”, or it might be measurement error, including measurement bias. (It’s most likely all three.) I don’t doubt that luck is part of that error term, and that the Pythagorean residuals do tell us at least something about luck. But, in finding that the residuals are not unbiased, to the tune of a correlation between .25 and .38, depending on which formula you use.Ryne, I ran your analyses, using the Exp182 formula predictions. I selected for predicted W% of between .495 and .505 (to represent .500 teams.) I did similar selections for .400, .450, .550, and .600

I then checked the mean residual for the Exp182 W%. They were:

.400 range = .00481 (about .77 games in 162)

.450 range = .00373

.500 range = .00308

.550 range = .00040 (almost zero here)

.600 range = -.00175

Notice the progression: as predicted winning percentage goes up, the mean residual goes down. The overall correlation between Exp182 predicted W% and the residuals of Exp182 is -.090. Not

*continued from above… errant button push*

Not a particularly big correlation but one that still suggests a bit of bias. However, I’m more concerned with the original -.38 correlation between actual W% and the residuals. You could call that good teams “making their own good luck”, but I still call it measurement error.

You’re right to use the term measurement error; I’d just tried to use “luck” to keep with the argument. Thanks for running the model.

Clearly, you’re right, as assumptions of homoscedasticity have been violated (for non-stats types, an assumption of regression is that errors are normally distributed around zero at every value of the independent variable). The reduction in the correlation between wins and erro when moving from actual to predicted wins fits with my conceptualization of the model, as I’d predict zero correlation in the predicted wins and a slight positive in the acutal wins.

To check this, I used SAS PROC NLIN to model the Pythagorean exponent and output residuals on Lahman data. Because it’s all I had handy (from a previous analysis), it only includes data from 1901 for the 16 teams that have existed since then. Given the functional form of the model (RS^a)/(RS^a+RA^a), I found an exponent of 1.8627, SE=.0144. I then plotted the residuals, and found no linear trend with the predictor, as that’s an assumption of the model. To check for a nonlinear violation of homoscedasticity, I split the data into five groups as you did: .400, .450, .500, .550, .600. I found residual means of [-.00073, .00042, .00025, .00167, .00004], none of them significantly different from zero (or even close, the R^2 was .997). In this particular dataset, I’m not seeing the effect you’re gotten. I’d like to talk more, if you’d like to send me an e-mail.

Ryne

One small comment…I’m seeing too much reliance on correlations and not enough focus on error MAGNITUDES in these discussions. The latest round of tests here work well to tell you whether you have a directional bias, but when I fit data, I don’t use correlations, I use root mean square errror…It’s entirely possible to get two fits, both of which have an R^2 of .95…one of which has an RMSE twice as bad as the other.

Matt, in this particular case, the magitude of the error isn’t in question. I actually covered that in my original post. My point is that because the errors themselves are used as measurements (of luck/manager skill), they need to be evaluated as well. As they are errors, the easiest check is to see whether they are biased. You’re right though to be looking at the magnitude of the error as well.

Ryne, I’d love to e-mail chat, but I can’t find yours. My profile has my e-mail.

Matt,

The RMSE for the above PROC NLIN is .0285, or about 4.6 wins on the 162 schedule. I just reran it using only data from 1960 on, and got an RMSE of .025, or about 4 wins. However, a lower RMSE does not necessarily mean better fit. The fit didn’t improve when I removed the early data; the variance just went down because I removed some teams from the 1880s with perfect winning percentages. RMSE is related to the original variance in the system, such that if two population of identical size have RMSEs of .05 and .10 with identical R^2 values, then the original variances differed by a factor of two. Using a scale free measure like R^2 gives an estimation of fit that isn’t dependent on understanding the scale.

My analyses today have come out such that if the exponent is correctly specified, the bias goes away. I’ve found exponents in the 1.86-1.87 range with no evidence of bias. However, if the model is misspecified (with the wrong estimator, for instance), bias should exist, which is what the author found with the 1.82 and 2.0 exponents.

Sorry about the “the author” stuff. I don’t think I can call an otherwise rational adult “Pizza Cutter” in public.