# Game Averaged PythagenPat

March 30, 2007 12 Comments

Perhaps the biggest problem with seasonal pythagorean win estimators is that they are heavily influenced by blowout games. When a team faces the Royals and beat the tar out of an injured Runelvys Hernandez to the tune of 20 runs, that makes far less of a statement (in reality) about their intrinsic ability to win games against average competition than it would if they faced, say, the Blue Jays and beat them 6-4. It cannot truthfully be said that there AREN’T teams that carry the “bully” personality, beating up on bad teams and getting consistently outplayed by good ones.

I was thinking about this problem of blowout games vs. close games about a year ago and then I realized that a good way to neutralize runs scored in “Garbage Time” would be to use PythagenPat itself to inforce the assertion that all games have a maximum value of ONE win. Whether you win a game 25-1 or 6-1, you can only get that one W, and chances are, that 25-1 game was largely against inferior competition.

So here’s what I did:

- I calculated PythagenPat winning percentages for each individual game in baseball history using the gamelog files from Retrosheet.org. In order to do this right you have to find a Patriot Exponent for each game (X = (RS + RA) ^ 0.285 for a single game), and then calculate each team’s W% in that game (W% = RS^X / (RS^X + RA^X)).
- I grouped the data by winning and losing team and gathered statistics on run scoring and allowing, and winning and losing Game PythagenPat W%s.
- I merged all of that information into one nice Excel table which shows RS and RA for each team in just their wins, just their losses, and in all games combined, as well as showing total Single-Game PythagenPat wins garnered in their wins, in their losses, and in all games.
- I calculated seasonal PythagenPat W% for just the wins, just the losses, and for the entire season for each team.
- I calculated Game-Averaged PythagenPat W% (the Single-Game wins divided by the number of games in which those wins were obtained) for the wins, the losses, and the total season.

Immediately, I observed that teams with a reputation for not being as good as their RS and RA were showing up as having a significantly weaker Game-Averaged PythagenPat (I’m going to dub this new statistic PythagenMatt so I don’t have to keep typing Game-Averaged all the time) records than Seasonal PythagenPat records. I also noticed, however, that in general, all teams tended to pull toward the center (a .500 W%) when doing it this way. This makes a certain amount of sense as a majority of run scoring happens on the winning side…the winning team tends to outscore the losing team by roughly double the rate if you look through history, and PythagenMatt will take a bite out of every one of those wins, while taking a bite out of the negative results of losses as well (thus moving teams toward the middle).

I considered abandoning the idea of PythagenMatt, but I decided first to see how PythagenPat and PythagenMatt related to actual W%. I graphed PythagenPat (Seasonal) W% (Y axis) vs. Actual W% (X axis) first, and took a Linear Best Fit Line through the data. Note here, I eliminated all teams that did not have at least 100 games played from the sample before doing this, because small samples of games cause wonky outliers occasionally give an incorrect sense of the utility of both PythagenPat and PythagenMatt. This is what the PythagenPat distribution looks like:

(click to view full image)

Note the R^2 value (0.9127). This lines up well with most studies done on the reliability of PythagenPat and all other W% estimators. It represents an R value of 0.9553.

This is what you get when you look at PythagenMatt:

(clickable once again)

The tilt is wrong, obviously, thanks to the center-pulling bias, but notice how much more compact the scatterplot looks along the line of best fit. Also notice the dramatically improved R^2 of 0.9585 (an R value of .979).

We can solve the center-pulling bias with a simple linear translation using the line of best fit obtained above (y = 0.6938x + 0.1531). Remember that y is PythagenMatt, so if we want to use PythagenMatt to project Actual W%, we need to invert the equation with some simple algebra to get W% = (PyM – 0.1531) / 0.6938. Linear translations have no effect at all on correlation, so we’ve removed all sources of bias andretained our stronger correlation.

What results is a pythagorean W% estimator that (a) has no bias by definition (b) removes the problem of blowout games from the equation which (c) causes a much stronger correlation with reality.

I’ll leave it up to the commenters to decide whether I’m off my rocker or whether I stumbled into a useful improvement on Pythagoras, but I found it interesting.

Interesting, but I’m not sure how useful it is. It can’t answer this question:

If the Indians have the talent to score 850 and allow 750 runs this year, how many games can we expect them to win?

It can answer the question of “how many games should this team have won given their run distribution?”

But by the time we can answer this question, we already know how many games they won, and we know from standard (and easy to calculate) pythag that they must have won a bunch of blowouts and lost close games.

It’s very important to the calculation of player ratings to correctly assess the strength of their team..good projections of future performance of players depend on an accurate assessment of their ability against average teams.

“That you would dismiss the importance of (more) correctly rating the skill of a team in the past leads me to assume that you believe actual wins are a more perfect measurement, and that is something I find it difficult to agree with.”

Perfect measurement how? In that that is the perfect estimate of how many wins they had yes, in the measurement of the true talent of the team, no.

Lets continue to pick on the Indians:

They won 78 games. Pythagenpat says they had 89 win talent. Pmatt gives them 83 (assuming I did it right). I’m not sure what the best estimate of their talent was. Pmatt may be it. I’m not sure though. Isn’t winning by blowout a characteristic of a good team.

How does that help us going forward though, when the makeup of the roster changes? I really don’t care how many wins, by any estimate, they had last year, I care about the talent level of the players. I don’t think we can assume that they’ll score a majority of their runs in blowout games again.

One thing I can think of that might improve the W-L estimates (besides getting better player projections) is accounting for leverage. I should give a leverage adjustment to the closer or perhaps top two relievers.

BTW Sean…I consider it one of the great mistakes of sabermetrics to treat what has happened in the past as though it too weren’t prone to being misread due to luck/random variation, context, or specific problems and skills of a given team.

You say “It can answer the question of how many wins should this team have gotten?” but by that time we know how many wins they actually produced. Why should I CARE how many wins they actually produced when trying to decipher how good that team was against average competition? Why trust a small sample of 162 trials when a have a method for making what is essentially a much larger data set more accurate?

The reason we try to make win estimators closely relate to actual performance is that we’re trying to show that we actually modelling skill correctly while also eliminating the biases of small sample size, the element of randomness associated with the timing of run scoring, and the biases of outlier events (like blowout games).

That you would dismiss the importance of (more) correctly rating the skill of a team in the past leads me to assume that you believe actual wins are a more perfect measurement, and that is something I find it difficult to agree with.

Let’s assume for the moment that PythagenMatt is correct and that those 2006 Indians were an 83 win team, not an 89 win team. That’s important to know for future research because it gives you a more accurate base to start from when rating the 2006 Indian players and their contribution to winning games. I approach player rating from the top down angle because players cooperate in the production of wins, which you will never EVER see correctly using player level statistics without attaching them to the team. If I have an 89 win base, I’m going to overestimate the true talent of the Indians players, and by definition they’ll get 6 wins that should go to the rest of the AL. That may not sound like a huge deal, but if there are 30 teams whose’ true talent can be better appreciated by the PythagenMatt than PythagenPat models, then all of the players’ win estimates will be more accurate and then…when you build up the 2007 Indians projecting their win scoring rates based on an entire career that is more accurately measured, you’re going to get a more accurate projected win total for the 2007 Indians.

Now I’m sure you’re saying “but how do we know PythagenMatt is more accurate?” You raise an interesting question when you ask if winning by blowout is the mark of a good team. I would say that although good teams are more likely to win by blowout than bat teams, a blowout is more likely to occur when there is a talent mismatch in a single game…and therefore the game is less indicative of the strengh of any one team than the difference between their strengths. Which means the Indians shouldn’t get full credit for thwomping KC by 20 runs…that credit should be broken down some because part of the reason for the twomping is that they aren’t facing an average team.

Using leverage index had also occured to me BTW Sean…I just don’t have LI in my PBP database yet, so I can’t accurately leverage every play.

I’ve got a couple of questions about this, first some methods ones then some theoretical questions. First, the “center pulling” problem is a scale problem, right? It’s not the case that the PMat method predicts actual winning percentages with systematic and removable bias, which could indicate a methodological flaw? It sounds like the first is true, which is good.

I’d also like to understand the rationale and assumptions of this new model. One of the relatively implicit assumptions of the traditional Pythagorean model is that the shape of the distribution of run scoring is constant across teams, differing only in scale. Your model seems to toss out that assumption, or at least say that estimating a team’s run scoring ability by total runs is extremely sensitive to outliers. Is this a fair characterization?

The only real criticism I’d offer is that this level of sophistication in a Pythagorean estimator may not be appropriate because it kind of falls in between two approaches to team valuation. On one end, if you’re interested in modeling the underlying ability of each team because wins aren’t the best measure, you have to concede that runs are an imperfect measure as well, and that you should probably model the events within a game to figure out how many runs should have been scored, and how many wins should have resulted. On the other end, if your goal is to look at how teams performed, simple wins or a simpler estimator may be just as appropriate as this method. You almost have to choose between accurate and complex and quick and dirty, and I’m not sure this method adds enough to pay for its complexity. Have you considered simply adding a mercy rule, preventing teams from scoring more than X runs or Y more than their opponents? Or fitting a R/G curve to each team and defining it by the peak of the curve, removing some of the influence of the outlier?

First question:

I *believe* that the center-pull is caused by the nature of taking a pythagorean for an individual game. When you do that, you are reducing the influence each team has over each individual game. The average win carries an .800 pythag and the average loss carries a .200 pythag…but the sum of average wins and average losses can never have the same W% spread as the sum of 1s and 0s (a win being 1 and a loss being 0). No matter what the score of the game, pythagorizing it reduces the gap between the winner and the loser, causing the center-pull. That’s a scale problem, not a systematic bias with the input data.

Second question:

The implicit assumption with a seasonal pythagorean win estimator is that all games are created equal and that the distribution of runs scored and allowed are not dependent on match-ups the balance of intrinsic strengths of the two teams. If you assume that is true, then you can also assume that the standard deviation of run scoring per game is dependent only on the team’s run scoring environment (hence the genesis of the pythagenpat exponent).

This new model rejects this assumption as inherently flawed. I don’t believe that all games are created equally, nor do I believe that all teams respond to their schedules with predictable spreads in run scoring and allowing rates. This model asserts that the results of an individual game should be reduced to a pythagorean projection so that we can see how often we expect each team to win that game if it’s replayed a million times under the same exact conditions.

While it may be true that run scoring has an error, it has a much smaller (relative) error even on a single-team seaosnal level than wins, and what’s more, the correlated linear relationship between PythagenMatt and actual W% gives us the power of the ENTIRE history of baseball to judge the merits of individual runs, taking away the vast bulk of their associated error. The standard deviation of error in a linear correlation that makes use of over 2000 teams is astronomically small and yet the RMSE and R^2 show tremendous accuracy in the prediction compared to the simple pythagenpat.

As to the level of sophistication…I’ve done all of the hard work. The linear regression model shouldn’t change much over time now. The only thing that makes this somewhat difficult is that you need to calculate pythagenpat results for each game…but all of the major statistics warehouses including ESPN and MLB.com now keep a log of game results, and it is a very simple matter to find single-game pythagenpats. It’s not THAT sophisticated mathematically.

I have not considered a mercy rule because (a) beating a team by 21 runs means more than beating them by 5 runs…not MUCH more…but more nonetheless. It would not be correct to omit those runs from existance…the math logic fails. That would also have a big center-pulling bias for the same reason that PythagenMatt does and would require the same level of data (a gamelog table and a linear regression to use a model to correct the center-pull problem).

I have also not considered curve fitting because that would be significantly MORE work (more complexity) than this relatively simple, logically pure, and very precise updated model.

I’m certainly open to the fact that for most general applications of W% estimators, PythagenMatt is more precise but at too high a price, but I believe something like this should be employed if you’re really interested in the ability of each team. I also believe you’ll find that PythagenMatt is more predictive of future results than PythagenPat is, though I haven’t run the study to confirm that.

OK, the starting point for a redesign of win shares?

That might work.

Yeha…I might have made it clearer in the original article, so this is partially my fault. I view almost everything I do from the lens of improving PCA…this is just another example of my attempts to improve my analysis of the entire history of the game…the improved analysis being intended for use in projecting future performance.

FWIW, here are the top 10 teams in PythagenMatt wins:

Year Tm W Pat Matt

1906 CHN 117 115.4 121.4

2001 SEA 116 109.8 114.8

1998 NYA 114 109.8 114.6

1904 NY1 109 109.2 112.7

1927 NYA 111 111.6 112.4

1939 NYA 106 113.1 111.4

1998 ATL 106 106.5 110.5

1907 CHN 111 101.9 109.8

1954 CLE 111 104.6 109.2

1909 PIT 112 106.1 109.1

Obviously this doesn’t adjust for the quality of the league…that’s a whole series of additional articles I’ve got planned for you folks. ðŸ™‚ But it’s certainly interesting at least to me. ðŸ™‚

What does PCA stand for?

Sean,

PCA = Pythagorean Comparative Analysis

I invented the basic method behind PCA a few years ago after reading Win Shares and (a) liking James’ top-down approach but (b) hating a lot of the internal methods he used…especially when it came to team fielding and pitching results. PCA is therefore based on the top-down team-rating approach, but employed more current sabermetric philosophy (linear weights instead of runs created, advanced pythag instead of straight wins, DIPS theory instead of 6 component stats to divide pitching and fielding, etc).