Creating a dynamic FIP with BaseRuns
August 15, 2008 13 Comments
If you’re interested in starting a fistfight at the next SABR convention (not that I’m advising this) simply start bringing up DIPS in casual conversation loudly enough and I’m sure you can get something going. Voros McCracken set up the sabermetric version of the “less filling, tastes great” argument when he wrote:
There is little if any difference among majorleague pitchers in their ability to prevent hits on balls hit in the field of play.
Suffice it to say that not everyone agrees with this.
But what everyone does agree on is that pitchers have far less control over the outcome of a ball in play than they do over the socalled Three True Outcomes: the walk, the strikeout and the home run.
From this, McCracken constructed dERA, essentially a run estimation model that attempts to isolate a pitcher’s performance from that of his defense.
For those looking for a quickanddirty shortcut for dERA, Tom Tango’s FIP is generally relied upon:
(13*HR+3*BB2*K)/IP+3.2
3.2 is the league factor that puts FIP on the same scale as ERA.
FIP is also often used as a sort of component ERA, to estimate a player’s ERA from his projected component stats. There is, of course, Bill James’ Component ERA for those purposes as well. (Confoundingly enough, Component ERA is traditionally abbreviated ERC. Since "Earned Runs Created" describes what ERC is and does perfectly, that’s what I tell myself ERC stands for.)
So I decided to run a comparison of some of these run estimators.
The formulas I used are available from these sources:
I will explain what BaseRuns is in just a moment. First, the study. I took every pitcher in the Baseball Databank from 1956 on that recorded at least one out; if your favorite run estimator wasn’t included, it’s probably because it used statistics not available in the BDB. I looked at average error of each metric compared to ERA and RA (which is like ERA, but included unearned runs) in the same season.
[Average error is essentially the average of the absolute value of the difference between the estimator and the actual. In this case, I’m weighting by the number of innings pitched. A sample formula from Excel – SUMPRODUCT(ABS(RABsERA),IP)/SUM(IP)]
RA  ERA  
BsRA  0.43  0.65 
BsERA  0.46  0.43 
RA  0.00  0.44 
ERA  0.44  0.00 
FIPERA  0.66  0.74 
ERC  0.69  0.57 
DIPS1  0.72  0.80 
Our mystery contestant BaseRuns seems to be walking away with the field. But, to be fair, DIPS and FIP are both constructed in such a way as to better predict future performance. So let’s look at the yeartoyear average error.
But what to use as our weights – the innings pitched in Y1, or Y2? How about both – or really the harmonic mean of both. Simply put:
1 / (1/IP1 + 1/IP2)
That gives us weights that heavily trend toward the smaller of the two numbers, so that pitchers who pitch very little in one of the two seasons are figures less into the totals than pitchers who pitch a lot in both years. So, without further adieu, here’s how our various estimators tracked ERA and RA in year 2:
RA

ERA


BsRA

1.05

1.07

BsERA

1.06

0.97

RA

1.10

1.08

ERA

1.18

1.03

FIPRA

0.94

0.95

ERC

1.17

1.03

DIPS

0.95

0.97

FIP and DIPS start to earn their keep here. That said, BaseRuns holds its own. "Earned Runs Created" barely rates a mention at this point; rounding to two significant digits erases its advantage over ERA in predicting future performance.
[I know some of you are wondering how on earth ERC is faring so poorly. Real sabermetricians have done real studies showing how well ERC does! The trick is that a lot of sabermetric studies are based upon seemingly arbitrary cutoffs based upon playing time. You have to be careful – what you end up with are results that apply real well to good players, but are questionable for the population as a whole.]
BaseRuns’ power is in its dynamic model of run scoring; FIP and DIPS are both essentially linear models of run scoring. Their power is in their (admittedly crude) regression component. So is there a way for us to combine the awesome powers of BaseRuns with the awesome power of FIP to create the greatest predictor of ERA the world has ever seen?
If you can do some algebra and laugh maniacally, the answer is YES!
…okay, so the answer isn’t yes. But let’s take a look at it anyway.
And here’s that explanation of BaseRuns that was promised earlier. BaseRuns is a dynamic run estimator; it is similar to Runs Created in much the same way that an F16 is similar to a Piper Cub – both of them do the same thing, but BaseRuns does it morebetter.
You will often hear sabermetric horror stories about BaseRuns. There’s the one about how the young couple out on Lover’s Lookout heard the radio report where BaseRuns was on the loose from its Excel spreadsheet, and the lady convinces her beau to drive her home… and when they arrive home, they open the car door only to find a B factor hanging from the doorhandle!
Don’t believe the hype. BaseRuns is entirely safe for children, pregnant women and Murray Chass. The basic form of the BaseRuns equation is:
A*B/(B + C) + D
A represents baserunners, B/(B+C) represents the percentage of baserunners who score, and D is equal to home runs.
Simple BaseRuns expressions for the four constants are:
A = H + W – HR
B = (1.4*TB – .6*H – 3*HR + .1*W)*1.02
C = AB – H
D = HR
1.02 represents a calibration factor that is used to calibrate the equation to a specific context. You can find many additional derivations of BaseRuns, that get more complex as they incorporate more data. There’s also a version of BaseRuns designed to work with official pitching statistics, which is what we’ve been using so far:
A = H + W – HR
B = (1.4*TBe – .6*H – 3*HR + .1*W)*1.1
C = 3*IP
D = HR
Where TBe = 1.12*H + 4*HR
Essentially we’re estimating the amount of extra base hits on balls in play. Everything else is basically the same. To convert to RA, just divide by IP and multiply by 9. To adjust it to ERA’s scale, multiply by .92. (That’s a simplistic model of unearned runs – we could construct a better one, but that’s an exercise for another time.)
If we want to introduce a FIP element, we need to replace hits with an estimate of hits, based on an average defense:
(.290 * (BFP – K – BB – HR))
Where BFP is Batters Faced by Pitcher, the equivalent of plate appearances. We hold the C and D factors the same, and only need to figure A and B factors. Simply substitute the above for H, and do some simplifying, and you get:
A = 0.29 * BFP – 0.29 * K + 0.71 * BB – 1.29 * HR
B = 0.31 * BFP – 0.31 * K + 0.2 * BB + 2.55 * HR
And so how well does FIPBsRA compare to RA?
RA

ERA


FBsRA1

0.93

0.96

FBsRA2

0.56

0.68

FBsERA1

0.97

0.88

FBsERA2

0.62

0.55

The first row is year 1 to year 2, the second row is year 2 to year 2. We sacrifice a fair bit of accuracy in our sameyear group by using FIPBsRA as opposed to BsRA, but do a good job of increasing our accuracy in predicting future performance.
Of course, with 2008 we only have a partial season’s worth of data to go with, so future performance is exactly what we’re interested in. And so, I’ve helpfully caclulated FIPBsERA for you. (Data via the Hardball Times.) Want to play around with it? Here’s your FIPBsERA calculator.
[I should mention, incidentally, that McCracken already has covered some of this ground before with his DIPS BsR. Discussion can be found here.]
Very cool, Colin, but I’m confused. Is ERC listed in your tables?
Apparently the editing tool that WordPress uses doesn’t like my tables – I went in and fixed a typo last night and it must’ve borked the tables. They should be fixed now. Sorry for the inconvenience.
Great. Thanks Colin. Second dippy question: Is FIPERA in the first table the same as FIPRA in the second? How about DIPS1 vs. DIPS?
Last question: if you’re estimating hits in FIPBsRA, what is the difference between that and FIP?
Yeah, those entries are the same. Don’t know why FIPERA changed to FIPRA in the second table; I changed the name of DIPS because I was using DIPS1 to stand both for DIPS 1.0 and DIPS in Year 1.
The difference between FIP and FIPBsRA is in the nature of the run estimation. FIP is essentially a linear weights formula. If one player has a higher home run rate and a higher walk rate than average, that player’s ERA will be understated by FIP, because it uses a static multiplier for home runs and walks. In reality, a home run “costs” a player with an aboveaverage walk rate more than a player with an average walk rate, because he’s more likely to have runners on base when he allows his home runs. That’s the advantage that a dynamic run estimator like BaseRuns has over linear weights. (The reason we can use linear weights for individual hitters is that they have a much smaller impact over their overall run environment than pitchers do.)
As for your comments on THT – realistically all of the stats could have been improved with a seasonal tuning adjustment. FIP and BsR both have explicit tuning factors, but ERC and DIPS could both be tuned to the specific dataset as well. I don’t know if that would impact the YtoY1 error, though, and that’s probably the most important.
Thanks, Colin. Makes a lot of sense.
Couldn’t we then simply improve FIP by adding in a multiplicative term (BB rate * HR rate) into the regression and seeing what the coefficient is on that one?
I hate to be a bother, but how does tRA do in this analysis?
What you’re proposing, if I’m right, would be something akin to:
A*HR + B*BB – C*K + D(HR*BB)
Of course, then we’d need to sneak K into that multiplier somehow (division?), because a highK pitcher has a lower run value for his walks, strikeouts and hits. I have no idea how that would impact accuracy. I do think it would make the entire proceedings a lot more awkward.
BaseRuns also does several nice things for us that other run modelers don’t do – the value of the home run is fixed at 14 runs, for example. Runs Created, for example, doesn’t do that.
Most traditional run estimators – dynamic or linear – tend to break down at the extremes. Check out the graphs here for an illustration of how. For hitters, that’s not that big of an issue.
But, for instance, in his career Mariano Rivera has an opposing line of .214/.268/.293. Greg Maddux? .250/.291/.358. Conversely, on average, hitters are .307/.403/.527 this season against Miguel Bautista. When we want to evaluate pitchers, we have enough aces and gas cans that we run up against the limits of other run estimators reasonably quickly.
Now, there is a lot of value in the sheer simplicity of FIP. This:
(0.29 * BFP – 0.29 * K + 0.71 * BB – 1.29 * HR)*(0.31 * BFP – 0.31 * K + 0.2 * BB + 2.55 * HR)/((0.31 * BFP – 0.31 * K + 0.2 * BB + 2.55 * HR) + (IP*3)) + HR
isn’t something you can carry around in your head and use routinely like FIP is. But you’re trying to balance that simplicity against accuracy. By the time you’ve made a more accurate FIP construct, you might as well be using DIPS or BsR, because you’ve lost that simplicity and haven’t really gained enough accuracy to make it worth the extra trouble.
(Of course, you could argue the same for FIPBsRA. We’re taking a shortcut around using a real regression model like a Marcels projection. Ideally what we want to do is our full, robust component regression, and then apply BaseRuns.)
Dan, several methods were left out due to the decision to use only “official” pitching statistics. Left out – and this is just off the top of my head – were:
tRA
xFIP
DIPS 2.0 & 3.0
QuikERA
I did leave open the possibility of going back and evaluating those methods, though; the years of the study were specifically chosen because I have batted ball data available via Retrosheet. I’ll have to consider it.
I suppose I’m looking more for a proper regression analysis with interaction terms all over the place. FIP is nice because it is simple, but I prefer to err on the side of more complex but more accurate. It’s a personal preference. I like things a little messy.
We’re trying to walk and chew gum at the same time here, and so we’re taking some shortcuts. The “correct” way to predict future pitching performance is through regressing the component stats (including a weighted average as well), and then to feed those components into a run estimator.
FIP takes two shortcuts – it uses a linear run estimator, and it simply regresses BABIP 100% and doesn’t regress anything else. For all that, it’s pretty powerful.
I will be posting Thursday on enhancing Marcel. So far, I’ve only done batting, but after reading this article I thought that for pitching we could follow the same procedue of weighting the past seasons each component, regressing to the mean, and then using this formula as the run estimator with which to rank the pitchers.