Improving BABIP Projection by Batted Ball Types
May 10, 2009 5 Comments
Normal
0
false
false
false
MicrosoftInternetExplorer4
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:”Table Normal”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:”";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:”Times New Roman”;
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}
In my last article for StatSpeak, I tested the major projections systems’ abilities to accurately project a variety of statistics other than general production level measures such as OPS or wOBA. One of the statistics that I tested thoroughly was Batting Average on Balls In Play (BABIP). The projection systems were much worse at projecting BABIP than other statistics, like homeruns, walks, and strikeouts. Presumably, that is because hitters vary far more in their abilities to achieve/avoid these outcomes, and BABIP is based largely on luck. However, hitters do vary by their abilities to hit safely on balls in play, and seeing as 70% of plate appearances result in a ball in play (and not a homerun), it is important to project these. As far as I know, none of the major projection systems use BABIP by batted ball type. It is my belief that this would greatly improve projection overall.
In this article, I will continue my study of hitters’ BABIP, using more data than I previously had
access to. Not only are the results
different, but I was able to test more variables than before since I had more
observations. In my last
article, I used data on hitters who had 300 PA from 2005-2008 to develop my
data. In this article, I was able to
incorporate detailed data on hitters who had 300 PA from 2003 to 2008, which
allowed me to study far more observations.
In trying to consider how to use three years of previous data to predict
a hitters’ BABIP, 2005-2008 only allowed me to look at 121 hitters. However, adding in the four year ranges of
2003-2006 and 2004-2007 allowed me to use 381 hitters (or more accurately,
number of times a hitter got 300 PA in four consecutive seasons, as some
hitters were included more than once). I
was also able to extend my study of predicting BABIP on groundballs, line
drives, and flyballs by using all of this previous data too.
The regression lines changed in several ways. For one thing, line drive percentage came out
significant in projecting BABIP again.
It was surprising that it was so statistically insignificant while
looking at the 2005-2008 data, but now it does seem like this is a persistent
enough skill that it can be used in BABIP projection.
Another particularly nice thing about using more data is
that you can minimize the effect of outliers.
For instance, the coefficient on the natural log of contact rate (the
percentage of pitches that a hitter swings and either puts in play or fouls
off) was cut in half, and the reason is fascinating. Coefficients in multiple regression equations
will tell you how to adjust the expected dependent variable (in this case,
BABIP) given what all the other variables were.
In my regressions, I incorporated groundball percentage, BABIP on
groundballs, flyballs, and correlates with BABIP on line drives. Additionally, I used the variable of natural
log of contact rate to help say which direction
BABIP should be expected to go given a hitter’s ability to make contact, since
the other coefficients served to regress BABIP back to the mean as far as
historical tendencies indicated it should.
However, the strong negative coefficient on natural log of
contact rate indicated that hitters with poor contact skills would see their
BABIP fall more quickly. That
coefficient remains significant and negative.
However, Ryan Howard screwed with my data when I only looked at 121
observations. His BABIP fell
dramatically after being strong in 2005-2007 (.358, .363, and .336), down to
only .289 in 2008. The regression wants
to avoid the nasty error term that Howard would have left and the way that
Howard differed from the rest of the league is how low his contact rate is and
how high his homerun rate is. Since the
natural log formulation served to expand the difference between his contact
rate and the rest of the league and to contract the difference between his
homerun rate and the rest of the league, the model used contact rate as the
cause of low BABIPs. I realized this
error when I began doing BABIP projection for many other players and found that
guys like Adam Dunn and Mark Reynolds with poor contact rates would have BABIPs
around .240 or lower by my equation, which clearly did not sound right.
The reason that Howard’s BABIP fell so much in 2008 is not
that he is a poor contact hitter. Being
a Phillies fan, I know from observation that defenses became better at shifting
against him. While teams began to shift
against him towards the end of 2006 and throughout 2007, they had not yet determined
where exactly to position their players.
They began playing the second baseman even deeper into right field and
closer to the line compared with other infield shifts, and this depressed his
BABIP a lot. Initially, I only used
contact rate in my equation at all because it is correlated with a high BABIP
on groundballs, but that is probably because hitters who do not center the bat
on the ball often weakly tap balls off the bottom of the bat. That does not describe Howard, who rips
groundballs into the same predictable locations. As a result of this and a mixture of bad
luck, Howard saw his BABIP fall in 2008, and the model tried to correct for
that by crediting his low contact rate as the cause. This is a pretty obvious warning to me to use
more data for regression equations.
After seeing that the coefficient was much smaller when I removed Howard
from the equation, I thought it was smart to test how the model did with more
data.
Normal
0
false
false
false
MicrosoftInternetExplorer4
st1\:*{behavior:url(#ieooui) }
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:”Table Normal”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:”";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:”Times New Roman”;
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}
table.MsoTableGrid
{mso-style-name:”Table Grid”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
border:solid windowtext 1.0pt;
mso-border-alt:solid windowtext .5pt;
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-border-insideh:.5pt solid windowtext;
mso-border-insidev:.5pt solid windowtext;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:”Times New Roman”;
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}
HOW DOES THE PREVIOUS
MODEL HOLD UP?
The table below describes the coefficients for the
regression equation for BABIP using the same independent variables as I did in
my previous article. Later on, I will
improve this equation, and also provide equations for BABIP by batted ball
types, and projecting BABIP using less than three years of data.
The values in the table include the coefficients, with the
standard deviations in parentheses.
Variables with p-values less than 1% (strongly statistically
significant) are bolded and italicized, variables with p-values between 1% and
5% (statistically significant) are bolded,
and variables with p-values between 5% and 10% (weakly statistically
significant) are italicized. The first column is for the old data set and
the second column is for the new larger data set. Note that the variable is the non-weighted
average of the previous three years of data for each of these variables.
|
Variable |
Small Dataset: |
Large Dataset: |
|
GB% |
0.934 (0.333) |
0.319 (0.182) |
|
Groundball BABIP (GB-BABIP) |
1.48 (0.577) |
0.515 (0.311) |
|
GB-HIT% (GB%*GB-BABIP) |
-2.89 (1.32) |
-0.596 (0.708) |
|
Infield flies/flyball |
-0.362 (0.073) |
-0.336 (0.418) |
|
OFHIT% (FB%*FB-BABIP) |
0.717 (.264) |
0.243 (0.0135) |
|
LN(HR/AB) |
0.0137 (.00499) |
0.0119 (0.00332) |
|
LN(CONTACT%) |
0.0161 (.0449) |
0.00889 (0.0277) |
|
_CONSTANT |
-0.0808 (0.146) |
0.178 (0.0802) |
Look how different these coefficients are! It seems that the interaction term between
groundball percentage and groundball BABIP is no longer significant at all, and
we will see that I remove it from my later equation. Leaving it in the equation actually messes
with the coefficients on the two variables themselves. The interaction term still did come out
negative, but not nearly as much as before.
Infield flyball percentage remained pretty similar, as did natural log
of homerun rate. The percentage of balls
contacted that were flyballs that fell in the outfield is only weakly
statistically significant now, and it turns out that it is no longer as good a
measure as before, now less than half its size.
As I mentioned, the natural log of contact rate remains strongly
statistically significant, but not nearly as much as before.
The R-squared fell from a preposterously high .39 last time
down to .28 this time (meaning that actual BABIP correlated with projected
BABIP by 63% last time and only 53% this time).
That is certainly still respectable, but it seems that the high
correlation before might have come from fitting an equation to match up well
with only 121 observations.
Realizing that this equation looked nowhere near as strong
as it did before, I set out to construct an improved equation. Last time that I began studying BABIP, I
started with looking at team level data to consider the persistence of BABIP by
batted ball type at a team level, and then moved on to work on projecting BABIP
on groundballs, flyballs, and line drives here. This provided some insight to look at BABIP
overall as a function of some of the regressors used to project BABIP for
batted ball types. I decided to redo
this analysis with the larger quantity of data.
FLYBALLS
I began with flyballs.
Hitters who do well on flyballs do so by avoiding infield flies
primarily. On top of that, they also
tend to hit their flyballs over a larger area of the field. This is because they spread the ball around
well from left to right and consistently hit the ball deep enough that they hit
the ball all over the outfield. In my
previous study using only 2005-2008 data, I only found infield fly rate to be
statistically significant, but this time, I found that outfield flyball BABIP
was strongly statistically significant as well.
Additionally, switch hitters did incredibly on flyball BABIP. As before, the variables referred to below
are the non-weighted average of the previous years of data for this
variable. The regression for projecting
flyball BABIP comes out looking like this:
|
Variable |
Coefficient (Standard |
|
Outfield flyball BABIP |
0.196 (.0629) |
|
Infield flies/flyball |
-0.331 (0.0554) |
|
Switch hitter (dummy variable) |
0.0136 (.00474) |
|
_Constant |
0.130 (0.014) |
Each of these three variables came out strongly
statistically significant. It is
surprising that switch hitters improve their BABIP on flyballs by 14 points of
batting average compared to other hitters with similar histories of flyball
BABIP. The current theory I have about
why this is true is similar to the theory I had in my last article about why
hitters with high groundball percentages tend to see their BABIP on groundballs
fall further over time. Scouts can get
less information on where hitters are likely to hit the ball if they have hit
fewer balls hit in general. Switch
hitters have not hit as many flyballs from each side of the plate, and so
positioning fielders is a more difficult task.
I do not know if this is why, but it could explain some of the effect.
GROUNDBALLS
Last time I tested groundball BABIP (GB-BABIP), I found that
it could be predicted well using historical GB-BABIP, infield hit rate, and
contact rate. This time, I was able to
get many other significant regressors as I had more data to work with. Historical GB-BABIP again explained a large
portion of variance in GB-BABIP in subsequent years. I refined my infield hit rate statistic as I
have in other research by summing together the infield hits and the number of
times a hitter reached on an error on groundball as this eliminates some of the
noisy measurement error when official scorers decide whether to score something
an infield hit or an error. I also
included the number of triples per at-bat as a proxy for speed, which came out
quite significant as well.
Faster hitters reach base on groundballs for several
reasons. One is that they can beat out
infield hits, but another is that fielders must play in to give themselves a
better chance of throwing the runner out in time, so they can squeeze more
balls in the hole and cause more errors.
Additionally, the contact rate on pitches in the strike zone
served to be a better correlate with GB-BABIP (though now became not quite
statistically significant), as contact with pitches outside the strike zone
often causes groundballs to be weakly hit.
I also introduced the rate a hitter swings at pitches in and out of the
strike zone, which have positive and negative effects, respectively. The coefficient on swinging at pitches out of
the strike zone is not statistically significant, but I include it since its correlation
with swinging at strikes is high enough that it leaves that the effect of
swinging at more pitches in the strike zone statistically insignificant
otherwise. The results yielded an
R-squared of .22, which means that 47% of the variance in groundballs can be
explained away by these variables.
Here are the regression coefficients with standard
deviations.
|
Variable |
Coefficient (Standard Deviation) |
|
GB% |
0.0739 (0.0357) |
|
GB-BABIP |
0.420 (0.0809) |
|
Reach safely/Infield GB |
0.194 (0.0936) |
|
Z-Contact% |
0.0751 (0.0509) |
|
Triples/At-bat |
0.846 (0.471) |
|
Z-Swing% |
0.0801 (0.0441) |
|
O-Swing% |
-0.0546 (0.0450) |
|
_Constant |
-0.0240 (0.0616) |
It seems that there is an ability to hit balls through the
hole, but a larger component seems to be ability to reach safely on balls hit
in the infield (which is pretty much based on speed). Good strike zone judgment seems to avoid
chopping balls into the ground. Hitters
who hit more groundballs actually are more likely to keep up their groundball
BABIP, surprisingly enough.
LINE DRIVES
As I have mentioned before, BABIP on line drives is mostly
related to luck. There is clearly a
positive correlation with BABIP on line drives and power. This is especially true for ability to hit
line drive doubles. Hitters who hit a
lot of line drive doubles are more likely to hit a lot of homeruns than other
hitters with similar homerun tendencies historically, but this effect also
works the other way. Hitters who hit a
lot of homeruns have far higher BABIP on line drives. In my previous research, hitters who had high
BABIPs on line drives in previous seasons were not any more likely than other hitters with similar
homerun rates in previous seasons to have high BABIPs on line drives in subsequent seasons, but there seems to be a
positive (though not statistically significant) effect of previously high
LD-BABIPs as well as previously high LD%.
Switch hitters also did extraordinary well on line drives. Here is that regression output:
|
Variable |
Coefficient (Standard |
|
LN(HR/AB) |
0.0220 (0.00526) |
|
LD-BABIP |
0.121 (0.0845) |
|
LD% |
0.237 (0.149) |
|
Switch hitter |
0.0168 (0.00778) |
|
_Constant |
0.658 (0.751) |
PUTTING IT ALL
TOGETHER
Using all of this information, I began putting together a
new improved equation for projecting BABIP using three previous years of
data. This equation included line drive
rate, in addition to groundball rate, which now came out as statistically
significant thanks to more data. It
included GB-BABIP but the interaction term between GB% and GB-BABIP was no
longer statistically significant at all, so I left it out. Percentage of times reached safely on
groundballs in the infield ({Infield Hits + Groundball Reach on
errors}/{Groundballs – Groundball hits to the outfield}) was strongly
statistically significant as well.
Additionally, while infield flies/flyball remained
significant, strongly negative, and pretty close to the same value as the
previous equation, it seemed much more helpful to use outfield flyball BABIP
(OFFB-BABIP) rather than the percentage of balls in play that were hits to the
outfield.
Natural log of homerun rate and natural log of contact rate
remained strongly statistically significant, but log of contact rate is now
less than half as large due to the neutralization of the Ryan Howard effect.
This equation had an R-squared of .32, which means that 57%
of the variance in BABIP can be explained by this equation. As I have mentioned before, the binomial
distribution suggests that pure chance should explain away at least a third of
it as the average hitter with 300 PA has about 400 balls in play each year.
Here is the equation to use with three years of 300 PA or
more available. Note that all eight of
the independent variables were strongly statistically significant, producing
p-values under 1%.
|
Variable |
Coefficient (Standard Deviation) |
|
LD% |
0.293 (0.0779) |
|
GB% |
0.167 (0.0311) |
|
GB-BABIP |
0.183 (0.0556) |
|
IFFB% |
-0.268 (0.0440) |
|
OFFB-BABIP |
0.111 (0.0473) |
|
LN(HR/AB) |
0.0164 (0.00337) |
|
LN(CONTACT%) |
0.0767 (0.0270) |
|
Reach Safely/Infield Groundball |
0.171 (0.0659) |
|
_Constant |
0.192 (0.027) |
This would help you project 155 major leaguers for the 2009
season. However, as there are many
players who will get a lot of major league at-bats in 2009 who failed to get
300 PA in 2006. In the next section, I
will provide estimates for the regression equations using two previous years of
data, and using one previous year of data.
FEWER YEARS OF DATA
TO WORK WITH
Using only two previous years of data left very similar
coefficients, but regressed almost all of the coefficients more towards
zero. They all remains statistically
significant, and all but natural log of contact rate remain strongly
statistically significant. Percentage of
infield groundballs that a hitter reaches safely on actually became more larger
(the only variable to do so). I imagine
that this is because the product of speed from three years ago may provide some
insight into how speedy a player is now, but less so. Players just beginning their careers may have
larger effects on BABIP of this. Here is
the output using two previous years of data:
|
Variable |
Coefficient (Standard Deviation) |
|
LD% |
0.259 (0.0563) |
|
GB% |
0.106 (0.0241) |
|
GB-BABIP |
0.113 (0.0410) |
|
IFFB% |
-0.226 (0.0322) |
|
OFFB-BABIP |
0.0880 (0.0332) |
|
LN(HR/AB) |
0.00975 (0.00248) |
|
LN(CONTACT%) |
0.0399 (0.0201) |
|
Reach Safely/Infield Groundball |
0.150 (0.0412) |
|
_Constant |
0.211 (0.0201) |
This equation had an R-squared of .24, meaning that 49% of
the variance in BABIP can be explained by just two years of data on these
variables. This equation used 635
observations.
Looking at what to do with just one year of data, I made a
similar equation again. There were
obviously a lot more observations–1035, to be exact. But it also was much less reliable, since it
used only one previous year of data. The
natural log of contact rate was no longer statistically significant at all (the
p-value when I did include was something like 56% or something similarly
useless). It seems that with one year of
data, it is slightly more advantageous to use the percentage of balls in play
that were flyballs and went for hits, rather than the percentage of flyballs
that went for hits (but the results were similar). I also included the triples/at-bat, which
captured some of the effect of speed, and therefore canceled out some of the
effect of percentage of infield groundballs that a hitter reached safely
on. The equation came out as follows:
|
Variable |
Coefficient (Standard Deviation) |
|
LD% |
0.192 (0.0317) |
|
GB% |
0.120 (0.0185) |
|
GB-BABIP |
0.0828 (0.0242) |
|
IFFB% |
-0.166 (0.0216) |
|
(FB%)*(FB-BABIP) |
0.127 (0.0535) |
|
LN(HR/AB) |
0.00576 (0.00157) |
|
Triples/At-bat |
0.527 (0.177) |
|
Reach Safely/Infield Groundball |
0.0786 (0.0256) |
|
_Constant |
0.209 (0.0140) |
This equation had an R-squared of .19, meaning that 44% of
the variance in BABIP can be explained from these variables. Most of these variables are similar to the
ones for the regression equations with more years of historical data, but GB%
was a higher coefficient. Presumably,
this is due to its correlation with groundball rate. Once again, all of these variables were strongly
statistically significant, except for the percentage of balls in play that were
flyballs & hits, which was statistically significant but had a p-value of
1.8%.
PLAYERS
Using the equation for three years of historical data when
the player had 300 PA in 2006-2008, or if not, the equation for two years of
historical data when the player had 300 PA in 2007-2008, and if not, the
equation for one year of historical data when a player had 300 PA in 2008, I
was able to project the BABIP for 277 players.
The table further down summarizes what all of those are. Obviously, the 2009 season is already
underway, but I have not incorporated this into my regression results
obviously. As one might expect, guys
like Derek Jeter, Matt Kemp, Chipper Jones, and Joe Mauer topped the list of
projected BABIP and guys like Mark Ellis, Craig Counsell, Khalil Greene, Joe
Mathis, and Omar Vizquel trailed.
Before showing the entire table, here are the top 10 guys
projected to improve their BABIP this year and the top 10 guys projected to
fall:
|
name |
babip08 |
ebabip09 |
Diff. |
|
Corey |
0.215 |
0.285 |
0.071 |
|
Jose |
0.243 |
0.303 |
0.059 |
|
Luis |
0.267 |
0.324 |
0.057 |
|
Kenji |
0.232 |
0.287 |
0.056 |
|
Carlos |
0.237 |
0.288 |
0.051 |
|
Austin |
0.25 |
0.299 |
0.049 |
|
Paul |
0.244 |
0.29 |
0.046 |
|
Geoff |
0.242 |
0.287 |
0.045 |
|
Brandon |
0.244 |
0.284 |
0.04 |
|
Gary |
0.237 |
0.276 |
0.04 |
|
name08 |
babip08 |
ebabip09 |
Diff. |
|
Milton |
0.388 |
0.322 |
-0.066 |
|
Ian |
0.362 |
0.298 |
-0.064 |
|
Kelly |
0.357 |
0.301 |
-0.056 |
|
Nick |
0.335 |
0.281 |
-0.054 |
|
Mike |
0.357 |
0.305 |
-0.052 |
|
Ray |
0.345 |
0.294 |
-0.05 |
|
Manny |
0.37 |
0.322 |
-0.048 |
|
Reed |
0.36 |
0.313 |
-0.047 |
|
Shin-Soo |
0.367 |
0.322 |
-0.045 |
|
Ryan |
0.342 |
0.297 |
-0.044 |
Unsurprisingly, most of the guys are ones that you would
expect–guys with very low or very high BABIPs in 2008, but there are still a
few interesting ones. Luis Castillo
should actually be a high BABIP guy: .324, apparently, but hit only .267 in
2008. That is presumably due to the
injuries he was suffering from last year.
In fact, he apparently has .362 BABIP so far this year according to
Fangraphs. Nick Punto hit .335 on balls
in play in 2008, but apparently he projects to be a low BABIP type and should
hit around .281. Indeed, his 2009 BABIP
thus far is only .234.
Without further ado, the projected BABIPs for 2009:
|
name |
Ebabip09 |
|
Derek |
0.362 |
|
Matt Kemp |
0.346 |
|
Chipper |
0.346 |
|
Joe Mauer |
0.345 |
|
Michael |
0.342 |
|
Fred |
0.341 |
|
Matt |
0.341 |
|
Denard |
0.341 |
|
Ichiro |
0.338 |
|
Yunel |
0.338 |
|
Andre |
0.335 |
|
Bobby |
0.332 |
|
Josh |
0.329 |
|
Kevin |
0.329 |
|
Edgar |
0.329 |
|
Nick |
0.329 |
|
Jayson |
0.329 |
|
Jeff |
0.328 |
|
Joe |
0.328 |
|
Carl |
0.327 |
|
Howie |
0.327 |
|
Marlon |
0.327 |
|
David |
0.326 |
|
Fernando |
0.326 |
|
Magglio |
0.325 |
|
Placido |
0.324 |
|
Luis |
0.324 |
|
Miguel |
0.324 |
|
Hanley |
0.324 |
|
Jamey |
0.324 |
|
Ivan |
0.323 |
|
Skip |
0.323 |
|
Orlando Hudson |
0.323 |
|
Manny |
0.322 |
|
Chase |
0.322 |
|
Milton |
0.322 |
|
Shin-Soo |
0.322 |
|
Robinson |
0.321 |
|
Jhonny |
0.321 |
|
Joey |
0.320 |
|
Mark |
0.320 |
|
Curtis |
0.320 |
|
Kelly |
0.320 |
|
Hunter |
0.320 |
|
Mark |
0.320 |
|
Felipe |
0.320 |
|
Aaron |
0.319 |
|
Mark |
0.319 |
|
B.J. |
0.319 |
|
Paul Bako |
0.318 |
|
Randy |
0.318 |
|
Gary |
0.318 |
|
Brian |
0.318 |
|
Ryan |
0.317 |
|
Corey |
0.317 |
|
Miguel |
0.316 |
|
Kosuke |
0.316 |
|
Elijah |
0.316 |
|
Vladimir |
0.315 |
|
Alex Rios |
0.315 |
|
Freddy |
0.315 |
|
Casey |
0.314 |
|
Ryan |
0.314 |
|
Cristian |
0.314 |
|
Lyle Overbay |
0.314 |
|
Maicer |
0.314 |
|
Jose |
0.314 |
|
Carlos |
0.314 |
|
Mark |
0.313 |
|
Darin |
0.313 |
|
Justin |
0.313 |
|
Albert |
0.313 |
|
Delmon |
0.313 |
|
Jody |
0.313 |
|
Reed |
0.313 |
|
Jason |
0.313 |
|
Edgar |
0.313 |
|
Garrett |
0.312 |
|
Ramon |
0.312 |
|
Jeremy |
0.312 |
|
Michael |
0.311 |
|
Chone |
0.311 |
|
Jeff Kent |
0.311 |
|
Jose |
0.311 |
|
Juan |
0.311 |
|
Chris |
0.311 |
|
David DeJesus |
0.310 |
|
Johnny |
0.310 |
|
Ryan |
0.310 |
|
J.D. Drew |
0.310 |
|
Ryan |
0.310 |
|
Omar |
0.310 |
|
Jose |
0.309 |
|
Brandon |
0.309 |
|
Torii |
0.309 |
|
Ryan |
0.309 |
|
James |
0.309 |
|
Dustin |
0.309 |
|
Brad |
0.309 |
|
Alex |
0.309 |
|
Ryan |
0.308 |
|
Chase |
0.308 |
|
Julio |
0.308 |
|
Kaz |
0.308 |
|
Jack Cust |
0.308 |
|
Justin |
0.308 |
|
Jimmy |
0.308 |
|
Ronnie |
0.308 |
|
Joey |
0.307 |
|
Brendan |
0.307 |
|
Adrian |
0.307 |
|
Mark |
0.307 |
|
Adam |
0.306 |
|
Geovany |
0.306 |
|
Dan Uggla |
0.306 |
|
Xavier |
0.305 |
|
Josh |
0.305 |
|
Todd |
0.305 |
|
Mike |
0.305 |
|
Erick |
0.305 |
|
Gabe |
0.304 |
|
Ty |
0.304 |
|
Geoff |
0.304 |
|
Conor |
0.303 |
|
Shane |
0.303 |
|
Jed |
0.303 |
|
Chris |
0.303 |
|
Grady |
0.303 |
|
Jose |
0.303 |
|
Lance |
0.302 |
|
Rickie |
0.302 |
|
Damion |
0.302 |
|
Willie |
0.302 |
|
Jason |
0.302 |
|
Evan |
0.302 |
|
Mark |
0.302 |
|
Mike |
0.302 |
|
Alfonso |
0.302 |
|
Akinori |
0.302 |
|
Russell |
0.301 |
|
Adam |
0.301 |
|
Ryan |
0.301 |
|
Aaron |
0.301 |
|
Franklin |
0.301 |
|
Marco |
0.301 |
|
Kelly |
0.301 |
|
Kevin |
0.300 |
|
Aramis |
0.300 |
|
Hideki |
0.300 |
|
Carlos |
0.300 |
|
Raul |
0.300 |
|
Mike |
0.300 |
|
Jacoby |
0.300 |
|
Ross |
0.300 |
|
Coco Crisp |
0.300 |
|
Brandon |
0.299 |
|
Austin |
0.299 |
|
Aubrey |
0.299 |
|
Alexei |
0.299 |
|
Clint |
0.299 |
|
Adam Lind |
0.298 |
|
Cesar |
0.298 |
|
Cody Ross |
0.298 |
|
Ian |
0.298 |
|
Melvin |
0.298 |
|
Ryan |
0.298 |
|
Adam |
0.298 |
|
Ryan |
0.297 |
|
Brian |
0.297 |
|
Brian |
0.297 |
|
Lastings |
0.297 |
|
Garret Anderson |
0.297 |
|
Willy |
0.296 |
|
Adrian |
0.296 |
|
Blake DeWitt |
0.296 |
|
Jeff |
0.296 |
|
Kurt |
0.296 |
|
David |
0.296 |
|
Jose |
0.295 |
|
Jay |
0.295 |
|
Troy |
0.295 |
|
Matt |
0.295 |
|
Carlos |
0.295 |
|
Vernon Wells |
0.295 |
|
Billy |
0.295 |
|
Gregor |
0.295 |
|
J.J. |
0.295 |
|
Casey |
0.295 |
|
Jay Bruce |
0.295 |
|
Bengie |
0.294 |
|
Ray |
0.294 |
|
Asdrubal |
0.294 |
|
Chris |
0.294 |
|
David |
0.294 |
|
Alex |
0.293 |
|
David |
0.293 |
|
Ian |
0.293 |
|
David |
0.293 |
|
Ben |
0.293 |
|
Rich |
0.293 |
|
Jeremy |
0.293 |
|
Tadahito |
0.293 |
|
Melky |
0.292 |
|
John |
0.292 |
|
Nate |
0.292 |
|
Rick |
0.292 |
|
Carlos |
0.291 |
|
Eric |
0.291 |
|
Doug |
0.291 |
|
Ramon |
0.291 |
|
Carlos |
0.290 |
|
Orlando Cabrera |
0.290 |
|
Paul |
0.290 |
|
Yuniesky |
0.290 |
|
Jim Thome |
0.290 |
|
Stephen |
0.289 |
|
Daric |
0.289 |
|
Bill Hall |
0.288 |
|
Carlos |
0.288 |
|
Jason |
0.288 |
|
Kenji |
0.287 |
|
Scott |
0.287 |
|
Geoff |
0.287 |
|
Alexi |
0.287 |
|
A.J. |
0.287 |
|
Jorge |
0.287 |
|
Richie |
0.286 |
|
Jeff |
0.286 |
|
Prince |
0.286 |
|
Luke |
0.286 |
|
Jack |
0.286 |
|
Jim |
0.286 |
|
Jason |
0.285 |
|
Corey |
0.285 |
|
Carlos |
0.285 |
|
Gerald |
0.285 |
|
Alfredo |
0.285 |
|
Scott |
0.285 |
|
Brandon |
0.284 |
|
Ken |
0.283 |
|
Willy |
0.283 |
|
Edwin |
0.283 |
|
Carlos |
0.283 |
|
Jack |
0.282 |
|
Yadier |
0.282 |
|
Jose |
0.282 |
|
Troy |
0.282 |
|
Chris |
0.282 |
|
Nick |
0.281 |
|
Jason |
0.281 |
|
Brian |
0.280 |
|
Rod |
0.279 |
|
Emil |
0.279 |
|
Jesus |
0.278 |
|
Luis |
0.278 |
|
Adam Dunn |
0.278 |
|
Dioner |
0.278 |
|
Juan |
0.277 |
|
Chris |
0.277 |
|
Gary |
0.276 |
|
Mike |
0.276 |
|
Miguel |
0.275 |
|
Marcus |
0.275 |
|
John Buck |
0.275 |
|
Pat |
0.274 |
|
Bobby |
0.273 |
|
Brad |
0.272 |
|
Pedro |
0.271 |
|
Jason |
0.271 |
|
Nick |
0.271 |
|
Joe Crede |
0.270 |
|
Kevin |
0.270 |
|
Khalil |
0.269 |
|
Mark |
0.265 |
|
Craig |
0.263 |
|
Jeff |
0.256 |
|
Omar |
0.253 |
In doing my previous article for StatSpeak on projection
systems and their ability to project various statistics, I realized that many
of them were not especially good at projecting BABIP. ZiPS has a tendency to project hitters to
have very extreme BABIPs that are unlikely to occur. PECOTA has a tendency to project speedy
hitters to have the same high BABIPs that speedy hitters used to have before
scouting data became some advanced (as PECOTA uses historical comparables), and
CHONE safely projects hitters towards the mean, but all of the systems I
studied had correlations with true BABIP of about .40-.44. Even for those players with one year of data,
the correlation I found was around .44 and for those hitters with three or more
years of historical data, my projected BABIP had a correlation with true BABIP
of .57. These systems do incredibly with
projected the three true outcomes, but 70.3% of plate appearances in 2008
resulted in an outcome other than a walk, strikeout, homerun, or hit by
pitch. As far as I know, none of the
major projection systems use batted ball data for hitters to project
statistics. These systems are getting
very good, and as Tom Tango has pointed out multiple times, the best systems
only do slightly better than Marcel the Monkey.
I strongly believe that the way to improve projection is to incorporate these variables that I
have used above to project BABIP in isolation.
Couldn’t agree with you more. I am trying to use GB/FB/LD data in predicting BABIP for my projection system, but it’s really rudimentary right now:
http://fantasyscope.wordpress.com/2008/11/13/2009-fantasyscope-early-projections/
As an approximation, I average out historical BABIP with projected based on that data. Please see my post on BABIP estimation:
http://saberrattling.wordpress.com/2008/12/03/working-the-numbers-on-babip-estimation/
You’ve seen PrOPS, right? It wasn’t a projection system, per se (no attempt to incorporate age, no regression to the mean, etc), but rather a system to try to identify lucky/unlucky batters in a given year based on their batted ball statistics. It seemed to me to be a big step forward, and I still use it as a diagnostic for hitters. But as you said, “no one” has gone the next step and included batted ball data into a sophisticated projection system. -j
I’ve seen PrOPS before, but it seems like more of a “postdictive” rather than predictive system, right? It more or less predicts what BABIP should have been in the past year based on GB/FB and LD% that occurred in that year, right?
That’s a little different only since this model was more or less predicting not only what GB%, FB%, and LD% were likely to be next year based on what they were the past few years, but also what BABIP on each of those batted balls. I think that people tend to ignore that power hitters have better averages on line drives because they hit them further, guys who pop up a lot tend to have lower averages on flyballs because infield flies are easier to catch, and faster players have better batting average on groundballs.
Your point still holds, and I probably should have held back the statement a little bit. What I meant was that you can improve projections of those systems like CHONE, PECOTA, OLIVER, MARCEL, and ZIPS by looking at BABIP projection using batted ball rates and BABIP by batted ball.
This is some heavy work, Matt, and I’m glad you put the hours into this valuable topic. Good job overall.
There is, however, one problem here. In a regression, you cannot include an interaction term like “(FB%)*(FB-BABIP)” without also including the individual components. You would need to add FB-BABIP.
Also, as a side note, when you report an r^2 of .19, it means the model can explain 19% of the variance, not 44% [the sqrt(.19)] as you have said here.
Finally, have you seen the work of Chris Dutton and Peter Bendix? You can find their work here:
http://www.hardballtimes.com/main/article/batters-and-babip/
Or, for what looks to be the full academic paper, here:
http://tangotiger.net/tufts/understandingBABIP.pdf
It would be interesting to see how your two (similar) approaches measure up to one another -and to other models – at the end of the year. I hope you will run an update for us!
I used the FB%*FB-BABIP term okay, I think, for the ones I did used. It is basically the percentage of balls in play on which a hitter gets a hit on a fly ball. The implication is that perhaps for smaller sample sizes like one year of data, it is best to directly consider how often that happens than to focus on what percent of balls in play are flyballs and what percent of those are hits.
You’re right about r^2. Just sloppy on my part. Thank you for pointing that out.
I have read that article. That article was retrospectively predicting BABIP, kind of like the LD%+.120 model that Studenmand developed a while back. It was not intended to predict BABIP in the future. It also had a few structural issues like regressing BABIP and pitches per extra base hit, which obviously is going to be negatively correlated because you’re using the part of the numerator of the dependent variable in one of the denominators of an independent variable. It was a good start though.
I developed a model of predicting BABIP in the future on individual batted ball types over at another blog shortly after that:
http://www.thegoodphight.com/2009/1/16/726379/babip-projection-and-new-s
Dutton later tried one of those on his own, but used a lot of the same variables that was explained in an article Derek Carty wrote over at THT.
http://www.hardballtimes.com/main/fantasy/article/whats-the-best-babip-estimator/
I later developed a couple other articles which topped that r^2 though. Here is the first in that series:
http://www.thegoodphight.com/2009/2/2/743228/improving-babip-estimation
I’m having trouble finding the second in that series, but it was here on StatSpeak but I can’t find the link. This above article was the 3rd in that series, and it did not include some of the variables for the larger dataset I used here, so my r^2 fell a little for the set with less years, but the r^2 in general is going to be higher this way since there are more direct historical BABIPs rather than focusing on correlates. There’s is useful if you don’t have many years of data but you have a lot of data about one year, but I still think it’s best to use BABIPs in previous years directly.