What run estimator would Batman use? (Part II)
September 5, 2008 7 Comments
If you haven’t already, I suggest you read Part I first, but it’s not strictly necessary, so long as you have a feel for how run estimators work. Part I goes into a lot of the background of how run estimators work, but there’s not a lot of technical detail.
Now, let’s go ahead and strap some run estimators down to the table, cut them open and see how they work.
Linear weights
First of all, when I refer to linear weights, I should clarify that I use the term to refer to any linear run estimator, not just Pete Palmer’s Linear Weights System. Onward, then.
Simply looking at a linear weights formula should be pretty straightforward. We’ll look at the reduced version of Extrapolated Runs, Jim Furtado’s version of a linear weights formula*:
(.50 * 1B) + (.72 * 2B) + (1.04 * 3B) + (1.44 * HR) + (.33 * (HP+TBB)) + (.18 * SB) + (.32 * CS) + ((.098 * (AB – H))
Essentially, every event is multiplied by its average run value, based on a certain run context. (In the case of XR it’s team seasons from 1995 to 1997, but you could use any context you wanted. You could put together a linear weights formula for, say, Greg Maddux’s career if you wanted to.)
This begs the question of how to determine the run value of an event. Looking simply at Runs Batted In won’t help – a single with the bases empty provides value. So what do we do? Here’s where a concept called run expectancy comes in handy. Every base/out state has a certain run expectancy, which essentially is how many runs on average a team scores from that point of the inning. I’m using values from this table by Tango, because they’re already in a nice arrangement.
0

1

2


___

0.555

0.297

0.117

1__

0.953

0.573

0.251

_2_

1.189

0.725

0.344

__3

1.482

0.983

0.387

12_

1.573

0.971

0.466

1_3

1.904

1.243

0.538

_23

2.052

1.467

0.634

123

2.417

1.65

0.815

There’s one case not strictly defined on the table; three outs means a run expectancy of zero.
The linear weights value of an event is the average change in run expectancy by an event. Let’s say you have runners on first and second, no outs; that’s a RE of 1.573. A player hits a double, scoring the two runners in front of him:
2 + 1.189 = 3.189
The double scored two runs and leaves the game with an RE of 1.189, for a total RE of 3.189. Subtract 1.573, and you get 1.616, the run contribution of that double. Take the average RE change of every double available in your dataset, and there’s your linear weights value of a double.
(There are other ways to estimate linear weight values when you don’t have sufficient data to do the Run Expectancy analysis; an overview of the subject is available.)
Pros:
 Linear weights estimators are very simple to use. The only math involved is subtraction and multiplication.
 You can apply them directly to individual players and get very reasonable results.
 They’re very flexible; you can come up with custom linear weights for almost anything imaginable. Want linear weights by pitch location? No problem!
Cons:
 They don’t work very well at the extremes; linear weights measure the average run contribution in the average run environment for the sample.
 Events in a linear weights model don’t interact with each other. If a team takes more walks and hits more home runs, the value of the home runs doesn’t increase to reflect the increased likelihood of baserunners being on when the homers are hit; the value of the walk doesn’t increase to reflect the increased likelihood of scoring on a home run.
 Unlike a dynamic run estimator, you have to build a custom linear weights model for different environments – or at least you do if you want your results to be close to accurate.
* Why Extrapolated Runs? Because it uses the same terms as the other formulas I’m testing. Because it has some name recognition. And because it absolutely kills in most accuracy studies I’ve seen, even in comparison to very good dynamic run estimators. I would not, however, recommend XR for player evaluation – XR kills at estimating team runs because that’s what it was designed to do. Instead, I’d recommend a good set of empiric linear weights, like Tom Ruane’s.
Runs Created
What to say about Runs Created.
It all started out rather simply – Bill James’ original Runs Created formula is probably the simplest run estimator possible:
OBP * SLG * AB
This is fantastically useful if you ever have to estimate run scoring from nothing but crumpled up issues of The Sporting News, or from the full screen graphics on ESPN during SportsCenter. (And if you’ve never had to do that, you’ve never written 4,000 plus words on the accuracy of various run estimators. Not because you need to do one to do the other, but because you need to be that kind of person.)
Unfortunately it didn’t end there. It was only beginning.
I should back up a moment and explain that the above is really shorthand for the actual Runs Created Basic formula. (Not a shortcut, though – it’s algebraically identical.) The full formula is:
(H+W)*TB/(AB+W)
Generally represented as:
A * B / C
Where A is baserunners, B is advancement and C is opportunities. (Opportunities in this case means plate appearances.) Since then, it’s been a succession of more complex and yet more complex formulas still, all following the basic A*B/C construct. (There are also “theoretical team” versions designed to correct flaws in applying RC to individual batters; since we’re studying RC at the inning level they’re not really relevant here.)
I won’t give you an indepth explanation of the differences between Tech1 RC and Tech7 RC and so forth; I do want to look a bit closer at the basic framework behind RC, though. You could rewrite RC as this:
Runners * % of runners that score
with B/C being your estimated scoring rate.
Pros:
 Basic versions of RC are very easy to use; Basic Runs Created is the perfect run estimator for when you’re on vacation or otherwise need a quickanddirty run estimator that’s also easy to remember.
 You can find Runs Created at pretty much any baseball website – even ESPN’s, for crying out loud. (They even use Tech1 RC, while Baseball Prospectus uses Basic Runs Created. It’s a crazy world we live in.) It’s like the McDonalds’ of run estimation – fast, hot and available on every street corner.
Cons:
 There’s a reason that James bothered to come up with even more complex versions of Runs Created – the simpler versions of RC leave something to be desired in the accuracy department. They of course sacrifice most of the simplicity of Basic RC.
 Using B/C as your scoring rate estimator leaves you with the problem that you can, in fact, end up with more runs scoring than there ever were baserunners, because you can get B values in excess of C. This is not in fact possible in baseball.
 On that note… well, consider a player with a solo home run in his first PA. His AVG/OBP/SLG would be 1.000/1.000/4.000. Well, according to Runs Created: 1.000 * 4.000 * 1 = 4. So, a solo home run is expected to score four runs. This is not in fact possible in baseball. And this is only an extreme example of how Runs Created handles the home run, not the breaking point. (The more sophisticated versions of Runs Created handle the home run better, but are still flawed in that regard.)
 Application to individual hitters leads to distorted numbers, hence the creation of “theoretical team” versions.
BaseRuns
BaseRuns was created by David Smyth in order to address the aforementioned problems in RC, and it follows a similar construction, with a few key differences. The basic framework for BsR is:
A * B/(B+C) + D
The D factor is equal to home runs. A home run guarantees at least one run scoring, so BsR conforms to that reality. That means a reduced weight for the home run in the A and B values, which just as in RC stand for baserunners and advancement. (C is still opportunity as well, but BsR uses outs instead of plate appearances.)
The other key difference is that BsR includes B in both the numerator and denominator of the scoring percentage estimator. By doing this, BsR caps the numbers of runs that can score at the number of baserunners in the inning, plus home runs.
Pros
 BsR is the best simulation (as of right now) of actual baseball scoring in the form of a single equation. That’s not just useful for accuracy, but if we want to use our run estimator as a model to solve a larger problem.
 Many BsR equations come with a “tuning factor” in the B factor to allow you to tweak the equation to fit a particular run scoring environment.
Cons
 More complicated than Basic RC (although, in my opinion, no more complicated than any of the further refinements of RC).
 Like RC, it’s problematic to apply to individual hitters, hence the need for “theoretical team” versions.
The problem of negative runs, and other concerns
One issue with most run estimators is that they will, in extreme lowscoring environments, estimate negative runs. In the case of a linear weights formula, there’s nothing you can do about it, and so I left XR untouched in this regard.
In the case of RC and BsR, however, you don’t need to include negative coefficients in the B factor. So… I removed them. I also included all known outs on the bases (caught stealing and double play) in the A factor for both RC and BsR.
I also counted reach on error as a single and an intentional base on balls as a walk. The two main reasons I’ve heard for excluding them is because of lack of data or to avoid giving a player credit for events they deserve no credit for. I question the second response, but neither is really relevant here.
Two play categories that are situation dependant
are sacrifices and double plays. Sacrifices (sac flies and sac hits) vary widely yeartoyear in run scoring value, because they can only occur in a limited number of plate appearances and so a handful of plays can greatly skew the value. Some years they have a positive value according to RE charts, other years a negative value. And I view them as a mostly arbitrary scoring category – you don’t change anything fundamental about the game if you no longer award sacrifices. I decided to exclude them from the test, counting them only as outs.
You could make a case for excluding double plays for similar reasons, but I didn’t for two reasons. One, double plays are much less volatile in their value year to year. Two, baseball without double plays is markedly different – if you no longer awarded double plays, game strategy would change.
And for all cases where an equation called for the number of outs, I simply used three. In the “real world” of team seasonal totals, many outs are unaccounted for (for instance, if a runner is awarded a single but thrown out trying to stretch it into a double, it doesn’t show up as an out anywhere in the official statistics as an out).
And in the event of partial innings (either due to the home team scoring in the ninth or later, or a stop in play for other reasons), I am simply considering the difference between outs recorded and three outs to be implied outs; once the inning is over you have a drastic change in run expectancy that you need to account for if you want your model to work at all on partial innings.
The test
The formulas used for the test, starting with linear weights (derived from XR):
(.50 * (1B+E)) + (.72 * 2B) + (1.04 * 3B) + (1.44 * HR) + (.33 * (BB+IBB+HBP)) + (.18 * SB) + (.32 * (CS + DP)) + (.098 * Outs)
Runs Created, based upon the formula published in the Bill James Handbook:
A: (1B + 2B + 3B + HR + BB + IBB + HBP + E – CS – DP)
B: (1.125*(1B+E) + 1.69*2B + 3.02*3B + 3.73*HR + .29*(BB + IBB + HBP) + .492*SB)
And BaseRuns, based on Smyth’s BaseRuns Primer version:
A: (1B + E + 2B + 3B + BB + HBP + IBB – CS – DP)
B: (.88*(1B+E) + 2.42*2B + 3.96*3B + 2.2*HR + .11*( BB + HBP ) + .99*SB)
C: 3
D: HR
I tested the accuracy of each estimator in three ways:
 Correlation, measured in R. R ranges from –1 to 1, with 0 meaning that there is no relationship between the two datasets, 1 meaning that there is a perfect positive relation between the two datasets, and –1 meaning there is a perfect negative relationship between the two datasets. Higher is better.
 Average Absolute Error (also called Mean Average Error). It’s a measure of the average distance from the estimated value to the actual value. Distance is figured by taking the absolute value (in other words, making negative values positive) of the difference between estimated runs and actual runs. Lower is better.
 Root Mean Square Error. Similar to AAE, except instead of taking absolute value, you take the square of the difference between estimated runs and actual runs, find the average, and then find the square root. The key difference between AAE and RMSE is that RMSE especially penalizes larger errors in estimation. Lower is better.
Each estimator was tested against the innings from Retrosheet’s playbyplay data.
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at 20 Sunset Rd., Newark, DE 19711.
All data is from 19562007, with the exception of the 1954 NL dataset. I generated line scores for each inning using a MySQL database based upon Tango’s RetroSQL schema.
And now, the results of that trial:
R

AAE

RMSE


RC

0.899

0.334

0.583

BsR

0.916

0.259

0.425

XR

0.871

0.374

0.495

BsR is the clear winner in all three tests; RC is second in R and AAE, while XR is second in RMSE. Rather than try to explain what’s going on here, I’ll show you.
These are scatterplots of our three estimators against the actual run values. Now, due to the fact that you can only have integer values for actual run scoring, and frequently have fractional estimates, you’ll notice an odd striping pattern. And, since we’re plotting roughly 1.9 million points in a relatively small space, they look like line graphs instead of scatterplots. Trust me, I watched the plotter: they’re a series of points.
Here’s the graph for Runs Created. (Click on it for a larger picture.) Note that at higher levels of run scoring, estimates start to outstrip actual run scoring – real runs in an inning tops out at 16 in the Retrosheet era, while RC tops out over 20 estimated runs.
RC also has very tall bars. Those tell us that RC has a lot of variance in its estimates. On the sides of the graph (and all the others, I should say) is a boxplot, showing us that most of the values of both estimated and actual runs are right in the zerotoone range.
XR has a few things going against it. Unlike RC, which tops out too high, XR tops out too low, at about ten runs. It also (as mentioned previously) has a problem with negative run estimates, even at actual run scoring of one run in an inning. All of our models suffer from missing information (like balks, catcher’s interference, or wild pitches) that can contribute to run scoring. [There are other reasons you can get negative estimates and positive runs, I should add.]
If RC was too hot, then XR is too cold.
BaseRuns, if it isn’t just right, is at least closer to functioning at the high end of run scoring than XR or RC. It also seems to have the smallest bars of any run estimator for almost all sets of values. It does seem to have a tendency to trend a little low for its run scoring estimates.
But it’s the only graph where the size of the x axis and the y axis are the same; BaseRuns seems to fit a lot better at the high end of run scoring than the other models, while not performing any worse down at the lower end of run scoring.
This isn’t the final word on run estimator accuracy. We can construct better, more detailed estimators of run scoring based upon these frameworks. (Which means more for next week!)
But this is the end of the line for Runs Created, at least as far as this series is concerns. Linear weights, despite not performing as well in the tests as our other estimators, has a lot of applications that a dynamic estimator is not wellsuited for. Runs Created, on the other hand, brings nothing to the table that BaseRuns does not except inaccuracy.
Next time: Excitement! Danger! Catcher’s interference! Same Bat time, same Bat channel!
This might be the nerdiest thing I’ve ever typed, but now I’m looking forward to Part 3 next week. I’m shivering with antici—— say it! — pation.
Great article. One quick question about how you get the 1.616 linear weights value for a double from the run expectancy chart. You mention that you average the value of a double from all 3×8=24 entries in the table. My question is, would it be correct to want to do a weighted average as we are in some states much more often than others? ie – Should a bases loaded state receive the same weighted average as a double with nobody on base?
vr, Xei
Xei – That’s how you calculate the run value of that particular double. When calculating the linear weight of a double, you do that for every double in the dataset, and then average that.
Colin, the methods you used for estimating runs at the inning level are pretty sound. Although you treated Linear Weights unfairly. XR is a regression equation. You could derive a Linear Weights equation easily from the Linear Weights you calculated here:
http://www.editgrid.com/user/cwyers/linear_weights_version_0.1
You also did a great job explaining how each run estimator works, especially the pros and cons of each.
Colin,
1. one possibly minor quibble. For runs created, the formula in Bill James Handbook does not use errors in the A or B factors. You say the version of RC you tested is “based on” that formula, but you haven’t changed any of its coefficients. The coefficents in the handbook version of the formula should be fitted to the absence of information about errors, which promote scoring, and other events, such as pickoffs and outs trying to advance, which diminish scoring. Maybe the references to errors are just typos in your description of the formula you used, but if not, I think the version of the formula you posted would overestimate runs
2) How exactly did you do this “by innings?” The estimators should predict values like 0.37 or 2.28 runs for any individual inning, but you are reporting R values in the neighborhood of .9 , presumably against the actual (integer value) runs in the inning. My naive intuition is that you couldn’t get correlations that high unless you have a more complicated method for testing the estimates against the actuals.
thanks!
All three of the run estimators were suboptimal constructions given the data set at hand. I will be rolling out more appropriately “tuned” run estimators in part III (and yes, that includes empiric linear weights).
As far as correlation goes, it’s important to note that correlation simply measures how closely related two things are – correlation tells us that a low run scoring environment gives us a a low RC figure and that a high run scoring environment gives us a high RC. It doesn’t tell us how accurate our run estimators are in the strictest of senses. This is how you can run correlations on OPS to run scoring or weight to home run production – you don’t have to even use the same units.
I turned off the correlation line in the scatterplots because it was very hard to read, but if you look at the general slope of those plots, that’s the correlation. It’s possible to have a high correlation and a very poor RMSE, or to have an excellent RMSE with a poorer R value.
Colin,
Is that BP link in your discussion of RC supposed to go to ESPN.com’s glossary?