What run estimator would Batman use? (Part I)
August 29, 2008 Leave a comment
[A note from the author: This study ended up becoming more involved than initially suspected, mostly because the author is bad at estimating such things. As such, this is the first part of the piece, which will eventually be published in two or three parts, depending. This part isn’t very technical, and largely concerns itself with the theory behind run estimation. I state this up front so that you don’t get 2,000 words into the document only to be disappointed that not a single run estimator has been evaluated at all.]
This isn’t the first study on run estimator accuracy, and I don’t promise it will be the most thorough. But I’ve been skirting around the issue in my previous work here, and so I figured it was time to finally get around to doing it proper, so that I can just have something to conveniently reference every time it comes up in the future.
Most previous studies of accuracy have concerned themselves with accuracy at the team level, using seasonal totals. This makes sense for a lot of reasons – run scoring is a team process, and team level run scoring data is readily available for entire seasons. Here’s the rub, though – estimating runs at the seasonal team level isn’t that hard. Here’s a look at the distribution of team runs per game, 19542007:
Notice how everything bunches up in the center? That’s because there isn’t a vast difference in run scoring totals between teams over the course of an entire season. That’s how you can explain the sterling accuracy of my latest run predictor, using runs per game:
Correl.

Avg. Error


RS_G

0.693

0.593

RA_G

0.673

0.095

Okay, so it’s not even as good as, say, batting average at predicting team run scoring. But it’s pretty decent, considering I just assumed every team was league average.
But we’re not interested in simply studying things that fall in the limited range of team run scoring totals, however. Take pitchers for instance. What does a histogram of qualified starters look like? 19562007:
That’s a bit more diverse, isn’t it? And we’re selectively sampling only those pitchers who were starters good enough to pitch for a whole season. Ace relievers and guys who get shelled in their first start and are banished back to the minors are an even more diverse group.
We aren’t going to get very far studying Very Interesting Things with run estimators that only work between 4 to 6 runs per game, which is over 75% of all teams in the sample used to generate the above graphs. The One True Run Estimator is going to have to work in extreme conditions, because sometimes we as sabermetricians need to work in extreme conditions. We need something industrial strength, something built Ford tough – the sort of run estimator that Batman would use, because who doesn’t want to be Batman? (Seriously – no lying here. You want to be Batman, don’t you.)
So, like NASA testing the latest Mars rover, we’re going to take our run estimators out into the most unforgiving of conditions – the harshest deserts, the chill of the Arctic circle, the very depths of the oceans themselves!
…no, actually, we’re just going to try them out on innings. Okay, halfinnings. From here on out, I’ll call them innings, because it rolls off the tongue better.
When you get down to brass tacks, run scoring happens on the inning level. Once you record the third out in the inning, you’ve closed the book on everything that happened before. A home run in the third inning won’t drive in a guy who takes a walk in the fourth inning. The inning is the smallest fundamental unit of the team run scoring process.
So what exactly is a run estimator, anyway?
Glad you asked, heading! What a run estimator does is it takes certain inputs (generally things like singles, doubles, triples, home runs, walks, etc.) and estimates how many runs would result from those events.
Why do we need run estimators? Well, if we’re attempting to use projections, then it’s handy to be able to use them to generate run scoring totals, the better to predict things like winloss records with.
They also help us to isolate a player’s individual contribution to run production. Runs come about in a team context, and a player’s runs scored and RBIs are heavily team dependent. Using run estimators, we can estimate how a player would perform on, say, a league average team. Or any team in history, really, or any team you can conceive of – want to see how a team of nine Barry Bonds would hit? No problem!
The basic principles of run estimation are simple. Every event contributes to or detracts from run scoring in the following ways:
 Provide a baserunner. With the exception of a home run, no event is capable of scoring runs without baserunners ahead to drive in. Every baserunner provides run scoring potential, sort of like the kinetic energy potential in a pile of kindling and tinder.
 Advance the baserunner. Once you have your baserunners – your potential energy – you drive them in by advancing the baserunners. This is your set of flint and steel, to provide a spark and create potential energy into actual energy.
 Avoid making outs. Once you reach three outs in the inning, the run potential drops to zero, no matter how many baserunners you have. Outs are like oxygen. Fire won’t burn without oxygen and runs won’t score without outs remaining in the inning.
All a run estimator does is, well, estimate the value of an event in light of these three aspects.
The three types of run estimators
Yep, there are three types of run estimators: linear, dynamic and models.
Linear estimators assume that (for the context in which the weights were generated) the run value of an event is a constant – a home run is worth 1.4 runs, a single .5 runs, etc. Some of the more popular implementations are Pete Palmer’s Batting Runs and Paul Johnson’s Estimated Runs Produced.
Dynamic run estimators attempt to describe the interaction of different events. On a team with a .500 onbase percentage, a home runs is worth more runs that a team with a .200 onbase percentage. The two main dynamic run estimators are Bill James’ Runs Created and David Smyth’s BaseRuns.
A model goes one step further, and actually simulates the workings of a baseball lineup, generally through matrix math and Markov chains. The only publiclyavailable Markov model I am aware of is Tom Tango’s.
As you move from one type of estimator to the next, you generally end up with more accuracy in exchange for more complexity. (Another benefit of linear run estimators is that they work very well for individual hitters; the other methods require additional steps in order to accurately portray a hitter’s contribution to a team context.)
A warning to those who pass through here
As I just hinted at, it is dangerous to apply a dynamic run estimator to an individual player – okay, not in a "It’s going to explode!" sort of sense, but it’s not advisable. The reason is because a hitter is (assuming a fulltime starter) only 1/9th of his run environment. While a player’s individual batting stats might look the histogram of a pitcher’s runs allowed, his environment looks more like the team histograms.
Or put another way – a player’s three aspects of run generation do not interact with each other, but with his team as a whole. He will provide a base runner for the hitters behind him, and drive in the players who bat ahead of him. When he avoids making an out, he more directly secures more plate appearances for his teammates in that inning; only indirectly does he provide more plate appearances for himself.
For this reason, most dynamic run estimators come in a "theoretical team" version, where a player plays in a lineup consisting of eight average players in addition to himself.
Linear weights provide a theoretical team construction as well, although it’s slightly different from the construct provided by a dynamic run estimator. A linear run estimator will generally play a player on a team that is average on the whole. That means that a very good hitter will be viewed in the context of being on a team of eight slightly belowaverage players, and a very poor hitter will be viewed in the context of being on team of eight slightly above average players.
In practice, the difference between theoretical team and linear run estimators in setting that baseline is minimal and rarely shows up in practice. Rarely is not the same as never, though.
Burying a lie
You will often hear people debate linear weights versus Runs Created by arguing about whether or not a certain measure values players above average, or above replacement, or above absolute runs.
This is completely irrelevant to our purposes. Any run estimation framework can be tweaked to produce runs above any baseline you desire. So long as you are aware of the context in which a player generated those runs (in short, how many plate appearances or outs he used) the answers are all the same. It’s entirely a matter of presentation, not of accuracy.
In our case, since we are attempting to validate against absolute runs, we will be tweaking our run estimators used to produce those values. This doesn’t mean that this is "correct" or desirable. The question you are trying to answer determines the correct baseline you should use, and nothing else.
A statement of principles
There is a real danger in doing a study like this. If you’ve ever read a run estimation study, you’ll generally find that the author’s pet run estimator comes out on top. (This is an exaggeration, but many of the people who publish run estimation studies do so in concert with a run estimator of their own, and I’ve never seen one that said that the author’s own run estimator is merely middleofthepack.)
There are generally two reasons for this. The first is publishing bias – generally when you review your run estimator and find it isn’t best of breed, you go back to the drawing board and try to improve your run estimator.
The other concern is that you can generate a lot of pointless accuracy by writing to your benchmark. If all you’re concerned about is generating the smallest RMSE or the highest R squared against your test data set, guess what? You’ll probably end up with the metric with the smallest RMSE or the highest R squared.
The problem of course is that you’re eeking out a slightly higher bit of accuracy against your sample data by sabotaging the overall construction of your model. You end up with things like the situational hitting adjustment and the reconciliation factor in New Runs Created. Or regression models of linear weights where sacrifice flies are considered more valuable than doubles.
It should not be true that "A sufficiently advanced run estimator is indistinguishable from the teams table in the Baseball Databank." If you really just want to know how many runs the Yankees scored in 1987, then look it up. The idea behind validating run estimators is to test their theoretical validity against a set of data where we know the answer to the question we’re asking, not as some sort of exercise in mathematical selfindulgence but so that we can take those run estimators and apply them to more interesting things.
Designing a run estimator with an R squared fetish makes it less, not more, suitable for those purposes. I’m less interested in validating that sort of construct, and more interesting in validating the framework behind the various types of run estimators.
That said – I have created my own set of empirical linear weights to use in this test, using separate run expectancy tables for every league and season in the Retrosheet data. I am relatively certain they’ll fall short of one of the dynamic run estimators, and I will publish anyway. By using the run expectancy method, I sidestep the concerns mentioned above – everything is based not upon accuracy but on the actual run scoring process. It’ll become more clear when I step through everything next week.
And on that note – can I explain the inner workings of Runs Created and BaseRuns? Can I explain how to figure out true linear weights? Can I describe the inner workings of a Markov model? Tune in next week! Same Bat time! Same Bat channel!