Improving BABIP Projection by Batted Ball Types

Normal
0

false
false
false

MicrosoftInternetExplorer4

/* Style Definitions */
table.MsoNormalTable
{mso-style-name:”Table Normal”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:”";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:”Times New Roman”;
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}

In my last article for StatSpeak, I tested the major projections systems’ abilities to accurately project a variety of statistics other than general production level measures such as OPS or wOBA.  One of the statistics that I tested thoroughly was Batting Average on Balls In Play (BABIP).  The projection systems were much worse at projecting BABIP than other statistics, like homeruns, walks, and strikeouts.  Presumably, that is because hitters vary far more in their abilities to achieve/avoid these outcomes, and BABIP is based largely on luck.  However, hitters do vary by their abilities to hit safely on balls in play, and seeing as 70% of plate appearances result in a ball in play (and not a homerun), it is important to project these.  As far as I know, none of the major projection systems use BABIP by batted ball type.  It is my belief that this would greatly improve projection overall.

In this article, I will continue my study of hitters’ BABIP, using more data than I previously had
access to.  Not only are the results
different, but I was able to test more variables than before since I had more
observations.  In my last
article
, I used data on hitters who had 300 PA from 2005-2008 to develop my
data.  In this article, I was able to
incorporate detailed data on hitters who had 300 PA from 2003 to 2008, which
allowed me to study far more observations. 
In trying to consider how to use three years of previous data to predict
a hitters’ BABIP, 2005-2008 only allowed me to look at 121 hitters.  However, adding in the four year ranges of
2003-2006 and 2004-2007 allowed me to use 381 hitters (or more accurately,
number of times a hitter got 300 PA in four consecutive seasons, as some
hitters were included more than once).  I
was also able to extend my study of predicting BABIP on groundballs, line
drives, and flyballs by using all of this previous data too. 

 

The regression lines changed in several ways.  For one thing, line drive percentage came out
significant in projecting BABIP again. 
It was surprising that it was so statistically insignificant while
looking at the 2005-2008 data, but now it does seem like this is a persistent
enough skill that it can be used in BABIP projection. 

 

Another particularly nice thing about using more data is
that you can minimize the effect of outliers. 
For instance, the coefficient on the natural log of contact rate (the
percentage of pitches that a hitter swings and either puts in play or fouls
off) was cut in half, and the reason is fascinating.  Coefficients in multiple regression equations
will tell you how to adjust the expected dependent variable (in this case,
BABIP) given what all the other variables were. 
In my regressions, I incorporated groundball percentage, BABIP on
groundballs, flyballs, and correlates with BABIP on line drives.  Additionally, I used the variable of natural
log of contact rate to help say which direction
BABIP should be expected to go given a hitter’s ability to make contact, since
the other coefficients served to regress BABIP back to the mean as far as
historical tendencies indicated it should. 

 

However, the strong negative coefficient on natural log of
contact rate indicated that hitters with poor contact skills would see their
BABIP fall more quickly.  That
coefficient remains significant and negative. 
However, Ryan Howard screwed with my data when I only looked at 121
observations.  His BABIP fell
dramatically after being strong in 2005-2007 (.358, .363, and .336), down to
only .289 in 2008.  The regression wants
to avoid the nasty error term that Howard would have left and the way that
Howard differed from the rest of the league is how low his contact rate is and
how high his homerun rate is.  Since the
natural log formulation served to expand the difference between his contact
rate and the rest of the league and to contract the difference between his
homerun rate and the rest of the league, the model used contact rate as the
cause of low BABIPs.  I realized this
error when I began doing BABIP projection for many other players and found that
guys like Adam Dunn and Mark Reynolds with poor contact rates would have BABIPs
around .240 or lower by my equation, which clearly did not sound right.

 

The reason that Howard’s BABIP fell so much in 2008 is not
that he is a poor contact hitter.  Being
a Phillies fan, I know from observation that defenses became better at shifting
against him.  While teams began to shift
against him towards the end of 2006 and throughout 2007, they had not yet determined
where exactly to position their players. 
They began playing the second baseman even deeper into right field and
closer to the line compared with other infield shifts, and this depressed his
BABIP a lot.  Initially, I only used
contact rate in my equation at all because it is correlated with a high BABIP
on groundballs, but that is probably because hitters who do not center the bat
on the ball often weakly tap balls off the bottom of the bat.  That does not describe Howard, who rips
groundballs into the same predictable locations.  As a result of this and a mixture of bad
luck, Howard saw his BABIP fall in 2008, and the model tried to correct for
that by crediting his low contact rate as the cause.  This is a pretty obvious warning to me to use
more data for regression equations. 
After seeing that the coefficient was much smaller when I removed Howard
from the equation, I thought it was smart to test how the model did with more
data. 


Normal
0

false
false
false

MicrosoftInternetExplorer4

st1\:*{behavior:url(#ieooui) }

/* Style Definitions */
table.MsoNormalTable
{mso-style-name:”Table Normal”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:”";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:”Times New Roman”;
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}
table.MsoTableGrid
{mso-style-name:”Table Grid”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
border:solid windowtext 1.0pt;
mso-border-alt:solid windowtext .5pt;
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-border-insideh:.5pt solid windowtext;
mso-border-insidev:.5pt solid windowtext;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:”Times New Roman”;
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}

HOW DOES THE PREVIOUS
MODEL HOLD UP?

 

The table below describes the coefficients for the
regression equation for BABIP using the same independent variables as I did in
my previous article.  Later on, I will
improve this equation, and also provide equations for BABIP by batted ball
types, and projecting BABIP using less than three years of data.

 

The values in the table include the coefficients, with the
standard deviations in parentheses. 
Variables with p-values less than 1% (strongly statistically
significant) are bolded and italicized, variables with p-values between 1% and
5% (statistically significant) are bolded,
and variables with p-values between 5% and 10% (weakly statistically
significant) are italicized.  The first column is for the old data set and
the second column is for the new larger data set.  Note that the variable is the non-weighted
average of the previous three years of data for each of these variables.

 

 

Variable

Small Dataset:
Coefficient (Standard Deviation)

Large Dataset:
Coefficient (Standard Deviation)

GB%

0.934

(0.333)

0.319

(0.182)

Groundball BABIP (GB-BABIP)

1.48

(0.577)

0.515

(0.311)

GB-HIT% (GB%*GB-BABIP)

-2.89

(1.32)

-0.596

(0.708)

Infield flies/flyball

-0.362

(0.073)

-0.336

(0.418)

OFHIT% (FB%*FB-BABIP)

0.717

(.264)

0.243

(0.0135)

LN(HR/AB)

0.0137

(.00499)

0.0119

(0.00332)

LN(CONTACT%)

0.0161

(.0449)

0.00889

(0.0277)

_CONSTANT

-0.0808

(0.146)

0.178

(0.0802)

 

Look how different these coefficients are!  It seems that the interaction term between
groundball percentage and groundball BABIP is no longer significant at all, and
we will see that I remove it from my later equation.  Leaving it in the equation actually messes
with the coefficients on the two variables themselves.  The interaction term still did come out
negative, but not nearly as much as before. 
Infield flyball percentage remained pretty similar, as did natural log
of homerun rate.  The percentage of balls
contacted that were flyballs that fell in the outfield is only weakly
statistically significant now, and it turns out that it is no longer as good a
measure as before, now less than half its size. 
As I mentioned, the natural log of contact rate remains strongly
statistically significant, but not nearly as much as before.

 

The R-squared fell from a preposterously high .39 last time
down to .28 this time (meaning that actual BABIP correlated with projected
BABIP by 63% last time and only 53% this time). 
That is certainly still respectable, but it seems that the high
correlation before might have come from fitting an equation to match up well
with only 121 observations.

 

Realizing that this equation looked nowhere near as strong
as it did before, I set out to construct an improved equation.  Last time that I began studying BABIP, I
started with looking at team level data to consider the persistence of BABIP by
batted ball type at a team level, and then moved on to work on projecting BABIP
on groundballs, flyballs, and line drives here.  This provided some insight to look at BABIP
overall as a function of some of the regressors used to project BABIP for
batted ball types.  I decided to redo
this analysis with the larger quantity of data.

 

FLYBALLS

 

I began with flyballs. 
Hitters who do well on flyballs do so by avoiding infield flies
primarily.  On top of that, they also
tend to hit their flyballs over a larger area of the field.  This is because they spread the ball around
well from left to right and consistently hit the ball deep enough that they hit
the ball all over the outfield.  In my
previous study using only 2005-2008 data, I only found infield fly rate to be
statistically significant, but this time, I found that outfield flyball BABIP
was strongly statistically significant as well. 
Additionally, switch hitters did incredibly on flyball BABIP.  As before, the variables referred to below
are the non-weighted average of the previous years of data for this
variable.  The regression for projecting
flyball BABIP comes out looking like this:

 

Variable

Coefficient

(Standard
Deviation)

Outfield flyball BABIP

0.196

(.0629)

Infield flies/flyball

-0.331

(0.0554)

Switch hitter (dummy variable)

0.0136

(.00474)

_Constant

0.130

(0.014)

 

Each of these three variables came out strongly
statistically significant.  It is
surprising that switch hitters improve their BABIP on flyballs by 14 points of
batting average compared to other hitters with similar histories of flyball
BABIP.  The current theory I have about
why this is true is similar to the theory I had in my last article about why
hitters with high groundball percentages tend to see their BABIP on groundballs
fall further over time.  Scouts can get
less information on where hitters are likely to hit the ball if they have hit
fewer balls hit in general.  Switch
hitters have not hit as many flyballs from each side of the plate, and so
positioning fielders is a more difficult task. 
I do not know if this is why, but it could explain some of the effect.

 

GROUNDBALLS

 

Last time I tested groundball BABIP (GB-BABIP), I found that
it could be predicted well using historical GB-BABIP, infield hit rate, and
contact rate.  This time, I was able to
get many other significant regressors as I had more data to work with.  Historical GB-BABIP again explained a large
portion of variance in GB-BABIP in subsequent years.  I refined my infield hit rate statistic as I
have in other research by summing together the infield hits and the number of
times a hitter reached on an error on groundball as this eliminates some of the
noisy measurement error when official scorers decide whether to score something
an infield hit or an error.  I also
included the number of triples per at-bat as a proxy for speed, which came out
quite significant as well. 

 

Faster hitters reach base on groundballs for several
reasons.  One is that they can beat out
infield hits, but another is that fielders must play in to give themselves a
better chance of throwing the runner out in time, so they can squeeze more
balls in the hole and cause more errors.

 

Additionally, the contact rate on pitches in the strike zone
served to be a better correlate with GB-BABIP (though now became not quite
statistically significant), as contact with pitches outside the strike zone
often causes groundballs to be weakly hit. 
I also introduced the rate a hitter swings at pitches in and out of the
strike zone, which have positive and negative effects, respectively.  The coefficient on swinging at pitches out of
the strike zone is not statistically significant, but I include it since its correlation
with swinging at strikes is high enough that it leaves that the effect of
swinging at more pitches in the strike zone statistically insignificant
otherwise.  The results yielded an
R-squared of .22, which means that 47% of the variance in groundballs can be
explained away by these variables.

 

Here are the regression coefficients with standard
deviations.

 

Variable

Coefficient

(Standard Deviation)

GB%

0.0739

(0.0357)

GB-BABIP

0.420

(0.0809)

Reach safely/Infield GB

0.194

(0.0936)

Z-Contact%

0.0751

(0.0509)

Triples/At-bat

0.846

(0.471)

Z-Swing%

0.0801

(0.0441)

O-Swing%

-0.0546

(0.0450)

_Constant

-0.0240

(0.0616)

 

It seems that there is an ability to hit balls through the
hole, but a larger component seems to be ability to reach safely on balls hit
in the infield (which is pretty much based on speed).  Good strike zone judgment seems to avoid
chopping balls into the ground.  Hitters
who hit more groundballs actually are more likely to keep up their groundball
BABIP, surprisingly enough.

 

LINE DRIVES

 

As I have mentioned before, BABIP on line drives is mostly
related to luck.  There is clearly a
positive correlation with BABIP on line drives and power.  This is especially true for ability to hit
line drive doubles.  Hitters who hit a
lot of line drive doubles are more likely to hit a lot of homeruns than other
hitters with similar homerun tendencies historically, but this effect also
works the other way.  Hitters who hit a
lot of homeruns have far higher BABIP on line drives.  In my previous research, hitters who had high
BABIPs on line drives in previous seasons were not any more likely than other hitters with similar
homerun rates in previous seasons to have high BABIPs on line drives in subsequent seasons, but there seems to be a
positive (though not statistically significant) effect of previously high
LD-BABIPs as well as previously high LD%. 
Switch hitters also did extraordinary well on line drives.  Here is that regression output:

 

Variable

Coefficient

(Standard
Deviation)

LN(HR/AB)

0.0220

(0.00526)

LD-BABIP

0.121

(0.0845)

LD%

0.237

(0.149)

Switch hitter

0.0168

(0.00778)

_Constant

0.658

(0.751)

 

 

PUTTING IT ALL
TOGETHER

 

Using all of this information, I began putting together a
new improved equation for projecting BABIP using three previous years of
data.  This equation included line drive
rate, in addition to groundball rate, which now came out as statistically
significant thanks to more data.  It
included GB-BABIP but the interaction term between GB% and GB-BABIP was no
longer statistically significant at all, so I left it out.  Percentage of times reached safely on
groundballs in the infield ({Infield Hits + Groundball Reach on
errors}/{Groundballs – Groundball hits to the outfield}) was strongly
statistically significant as well. 

 

Additionally, while infield flies/flyball remained
significant, strongly negative, and pretty close to the same value as the
previous equation, it seemed much more helpful to use outfield flyball BABIP
(OFFB-BABIP) rather than the percentage of balls in play that were hits to the
outfield.  

 

Natural log of homerun rate and natural log of contact rate
remained strongly statistically significant, but log of contact rate is now
less than half as large due to the neutralization of the Ryan Howard effect.

 

This equation had an R-squared of .32, which means that 57%
of the variance in BABIP can be explained by this equation.  As I have mentioned before, the binomial
distribution suggests that pure chance should explain away at least a third of
it as the average hitter with 300 PA has about 400 balls in play each year.

 

Here is the equation to use with three years of 300 PA or
more available.  Note that all eight of
the independent variables were strongly statistically significant, producing
p-values under 1%.

 

Variable

Coefficient

(Standard Deviation)

LD%

0.293

(0.0779)

GB%

0.167

(0.0311)

GB-BABIP

0.183

(0.0556)

IFFB%

-0.268

(0.0440)

OFFB-BABIP

0.111

(0.0473)

LN(HR/AB)

0.0164

(0.00337)

LN(CONTACT%)

0.0767

(0.0270)

Reach Safely/Infield Groundball

0.171

(0.0659)

_Constant

0.192

(0.027)

 

 

This would help you project 155 major leaguers for the 2009
season.  However, as there are many
players who will get a lot of major league at-bats in 2009 who failed to get
300 PA in 2006.  In the next section, I
will provide estimates for the regression equations using two previous years of
data, and using one previous year of data.

 

FEWER YEARS OF DATA
TO WORK WITH

 

Using only two previous years of data left very similar
coefficients, but regressed almost all of the coefficients more towards
zero.  They all remains statistically
significant, and all but natural log of contact rate remain strongly
statistically significant.  Percentage of
infield groundballs that a hitter reaches safely on actually became more larger
(the only variable to do so).  I imagine
that this is because the product of speed from three years ago may provide some
insight into how speedy a player is now, but less so.  Players just beginning their careers may have
larger effects on BABIP of this.  Here is
the output using two previous years of data:

 

 

Variable

Coefficient

(Standard Deviation)

LD%

0.259

(0.0563)

GB%

0.106

(0.0241)

GB-BABIP

0.113

(0.0410)

IFFB%

-0.226

(0.0322)

OFFB-BABIP

0.0880

(0.0332)

LN(HR/AB)

0.00975

(0.00248)

LN(CONTACT%)

0.0399

(0.0201)

Reach Safely/Infield Groundball

0.150

(0.0412)

_Constant

0.211

(0.0201)

 

This equation had an R-squared of .24, meaning that 49% of
the variance in BABIP can be explained by just two years of data on these
variables.  This equation used 635
observations.

 

Looking at what to do with just one year of data, I made a
similar equation again.  There were
obviously a lot more observations–1035, to be exact.  But it also was much less reliable, since it
used only one previous year of data.  The
natural log of contact rate was no longer statistically significant at all (the
p-value when I did include was something like 56% or something similarly
useless).  It seems that with one year of
data, it is slightly more advantageous to use the percentage of balls in play
that were flyballs and went for hits, rather than the percentage of flyballs
that went for hits (but the results were similar).  I also included the triples/at-bat, which
captured some of the effect of speed, and therefore canceled out some of the
effect of percentage of infield groundballs that a hitter reached safely
on.  The equation came out as follows:

 

 

Variable

Coefficient

(Standard Deviation)

LD%

0.192

(0.0317)

GB%

0.120

(0.0185)

GB-BABIP

0.0828

(0.0242)

IFFB%

-0.166

(0.0216)

(FB%)*(FB-BABIP)

0.127

(0.0535)

LN(HR/AB)

0.00576

(0.00157)

Triples/At-bat

0.527

(0.177)

Reach Safely/Infield Groundball

0.0786

(0.0256)

_Constant

0.209

(0.0140)

 

This equation had an R-squared of .19, meaning that 44% of
the variance in BABIP can be explained from these variables.  Most of these variables are similar to the
ones for the regression equations with more years of historical data, but GB%
was a higher coefficient.  Presumably,
this is due to its correlation with groundball rate.  Once again, all of these variables were strongly
statistically significant, except for the percentage of balls in play that were
flyballs & hits, which was statistically significant but had a p-value of
1.8%.

 

PLAYERS

 

Using the equation for three years of historical data when
the player had 300 PA in 2006-2008, or if not, the equation for two years of
historical data when the player had 300 PA in 2007-2008, and if not, the
equation for one year of historical data when a player had 300 PA in 2008, I
was able to project the BABIP for 277 players. 
The table further down summarizes what all of those are.  Obviously, the 2009 season is already
underway, but I have not incorporated this into my regression results
obviously.  As one might expect, guys
like Derek Jeter, Matt Kemp, Chipper Jones, and Joe Mauer topped the list of
projected BABIP and guys like Mark Ellis, Craig Counsell, Khalil Greene, Joe
Mathis, and Omar Vizquel trailed.

 

Before showing the entire table, here are the top 10 guys
projected to improve their BABIP this year and the top 10 guys projected to
fall:

 

name

babip08

ebabip09

Diff.

Corey
Patterson

0.215

0.285

0.071

Jose
Vidro

0.243

0.303

0.059

Luis
Castillo

0.267

0.324

0.057

Kenji
Johjima

0.232

0.287

0.056

Carlos
Ruiz

0.237

0.288

0.051

Austin
Kearns

0.25

0.299

0.049

Paul
Konerko

0.244

0.29

0.046

Geoff
Blum

0.242

0.287

0.045

Brandon
Inge

0.244

0.284

0.04

Gary
Sheffield

0.237

0.276

0.04

 

 

name08

babip08

ebabip09

Diff.

Milton
Bradley

0.388

0.322

-0.066

Ian
Stewart

0.362

0.298

-0.064

Kelly
Shoppach

0.357

0.301

-0.056

Nick
Punto

0.335

0.281

-0.054

Mike
Aviles

0.357

0.305

-0.052

Ray
Durham

0.345

0.294

-0.05

Manny
Ramirez

0.37

0.322

-0.048

Reed
Johnson

0.36

0.313

-0.047

Shin-Soo
Choo

0.367

0.322

-0.045

Ryan
Ludwick

0.342

0.297

-0.044

 

Unsurprisingly, most of the guys are ones that you would
expect–guys with very low or very high BABIPs in 2008, but there are still a
few interesting ones.  Luis Castillo
should actually be a high BABIP guy: .324, apparently, but hit only .267 in
2008.  That is presumably due to the
injuries he was suffering from last year. 
In fact, he apparently has .362 BABIP so far this year according to
Fangraphs.  Nick Punto hit .335 on balls
in play in 2008, but apparently he projects to be a low BABIP type and should
hit around .281.  Indeed, his 2009 BABIP
thus far is only .234.

 

Without further ado, the projected BABIPs for 2009:

 

name

Ebabip09

Derek
Jeter

0.362

Matt Kemp

0.346

Chipper
Jones

0.346

Joe Mauer

0.345

Michael
Young

0.342

Fred
Lewis

0.341

Matt
Holliday

0.341

Denard
Span

0.341

Ichiro
Suzuki

0.338

Yunel
Escobar

0.338

Andre
Ethier

0.335

Bobby
Abreu

0.332

Josh
Hamilton

0.329

Kevin
Youkilis

0.329

Edgar
Renteria

0.329

Nick
Markakis

0.329

Jayson
Werth

0.329

Jeff
Baker

0.328

Joe
Inglett

0.328

Carl
Crawford

0.327

Howie
Kendrick

0.327

Marlon
Byrd

0.327

David
Wright

0.326

Fernando
Tatis

0.326

Magglio
Ordonez

0.325

Placido
Polanco

0.324

Luis
Castillo

0.324

Miguel
Tejada

0.324

Hanley
Ramirez

0.324

Jamey
Carroll

0.324

Ivan
Rodriguez

0.323

Skip
Schumaker

0.323

Orlando Hudson

0.323

Manny
Ramirez

0.322

Chase
Headley

0.322

Milton
Bradley

0.322

Shin-Soo
Choo

0.322

Robinson
Cano

0.321

Jhonny
Peralta

0.321

Joey
Votto

0.320

Mark
DeRosa

0.320

Curtis
Granderson

0.320

Kelly
Johnson

0.320

Hunter
Pence

0.320

Mark
Teahen

0.320

Felipe
Lopez

0.320

Aaron
Rowand

0.319

Mark
Grudzielanek

0.319

B.J.
Upton

0.319

Paul Bako

0.318

Randy
Winn

0.318

Gary
Matthews Jr.

0.318

Brian
Roberts

0.318

Ryan
Braun

0.317

Corey
Hart

0.317

Miguel
Cabrera

0.316

Kosuke
Fukudome

0.316

Elijah
Dukes

0.316

Vladimir
Guerrero

0.315

Alex Rios

0.315

Freddy
Sanchez

0.315

Casey
Blake

0.314

Ryan
Zimmerman

0.314

Cristian
Guzman

0.314

Lyle Overbay

0.314

Maicer
Izturis

0.314

Jose
Reyes

0.314

Carlos
Guillen

0.314

Mark
Teixeira

0.313

Darin
Erstad

0.313

Justin
Upton

0.313

Albert
Pujols

0.313

Delmon
Young

0.313

Jody
Gerut

0.313

Reed
Johnson

0.313

Jason
Kubel

0.313

Edgar
Gonzalez

0.313

Garrett
Atkins

0.312

Ramon
Vazquez

0.312

Jeremy
Hermida

0.312

Michael
Bourn

0.311

Chone
Figgins

0.311

Jeff Kent

0.311

Jose
Guillen

0.311

Juan
Pierre

0.311

Chris
Davis

0.311

David DeJesus

0.310

Johnny
Damon

0.310

Ryan
Theriot

0.310

J.D. Drew

0.310

Ryan
Howard

0.310

Omar
Infante

0.310

Jose
Castillo

0.309

Brandon
Phillips

0.309

Torii
Hunter

0.309

Ryan
Sweeney

0.309

James
Loney

0.309

Dustin
Pedroia

0.309

Brad
Hawpe

0.309

Alex
Rodriguez

0.309

Ryan
Church

0.308

Chase
Utley

0.308

Julio
Lugo

0.308

Kaz
Matsui

0.308

Jack Cust

0.308

Justin
Morneau

0.308

Jimmy
Rollins

0.308

Ronnie
Belliard

0.308

Joey
Gathright

0.307

Brendan
Harris

0.307

Adrian
Gonzalez

0.307

Mark
Kotsay

0.307

Adam
Jones

0.306

Geovany
Soto

0.306

Dan Uggla

0.306

Xavier
Nady

0.305

Josh
Willingham

0.305

Todd
Helton

0.305

Mike
Aviles

0.305

Erick
Aybar

0.305

Gabe
Gross

0.304

Ty
Wigginton

0.304

Geoff
Jenkins

0.304

Conor
Jackson

0.303

Shane
Victorino

0.303

Jed
Lowrie

0.303

Chris
Iannetta

0.303

Grady
Sizemore

0.303

Jose
Vidro

0.303

Lance
Berkman

0.302

Rickie
Weeks

0.302

Damion
Easley

0.302

Willie
Harris

0.302

Jason
Bartlett

0.302

Evan
Longoria

0.302

Mark
Reynolds

0.302

Mike
Cameron

0.302

Alfonso
Soriano

0.302

Akinori
Iwamura

0.302

Russell
Martin

0.301

Adam
LaRoche

0.301

Ryan
Doumit

0.301

Aaron
Miles

0.301

Franklin
Gutierrez

0.301

Marco
Scutaro

0.301

Kelly
Shoppach

0.301

Kevin
Kouzmanoff

0.300

Aramis
Ramirez

0.300

Hideki
Matsui

0.300

Carlos
Gomez

0.300

Raul
Ibanez

0.300

Mike
Lowell

0.300

Jacoby
Ellsbury

0.300

Ross
Gload

0.300

Coco Crisp

0.300

Brandon
Boggs

0.299

Austin
Kearns

0.299

Aubrey
Huff

0.299

Alexei
Ramirez

0.299

Clint
Barmes

0.299

Adam Lind

0.298

Cesar
Izturis

0.298

Cody Ross

0.298

Ian
Stewart

0.298

Melvin
Mora

0.298

Ryan
Garko

0.298

Adam
Kennedy

0.298

Ryan
Ludwick

0.297

Brian
Giles

0.297

Brian
McCann

0.297

Lastings
Milledge

0.297

Garret Anderson

0.297

Willy
Taveras

0.296

Adrian
Beltre

0.296

Blake DeWitt

0.296

Jeff
Keppinger

0.296

Kurt
Suzuki

0.296

David
Ortiz

0.296

Jose
Lopez

0.295

Jay
Payton

0.295

Troy
Tulowitzki

0.295

Matt
Stairs

0.295

Carlos
Beltran

0.295

Vernon Wells

0.295

Billy
Butler

0.295

Gregor
Blanco

0.295

J.J.
Hardy

0.295

Casey
Kotchman

0.295

Jay Bruce

0.295

Bengie
Molina

0.294

Ray
Durham

0.294

Asdrubal
Cabrera

0.294

Chris
Coste

0.294

David
Murphy

0.294

Alex
Gordon

0.293

David
Eckstein

0.293

Ian
Kinsler

0.293

David
Dellucci

0.293

Ben
Francisco

0.293

Rich
Aurilia

0.293

Jeremy
Reed

0.293

Tadahito
Iguchi

0.293

Melky
Cabrera

0.292

John
Bowker

0.292

Nate
McLouth

0.292

Rick
Ankiel

0.292

Carlos
Delgado

0.291

Eric
Hinske

0.291

Doug
Mientkiewicz

0.291

Ramon
Hernandez

0.291

Carlos
Quentin

0.290

Orlando Cabrera

0.290

Paul
Konerko

0.290

Yuniesky
Betancourt

0.290

Jim Thome

0.290

Stephen
Drew

0.289

Daric
Barton

0.289

Bill Hall

0.288

Carlos
Ruiz

0.288

Jason
Kendall

0.288

Kenji
Johjima

0.287

Scott
Rolen

0.287

Geoff
Blum

0.287

Alexi
Casilla

0.287

A.J.
Pierzynski

0.287

Jorge
Cantu

0.287

Richie
Sexson

0.286

Jeff
Francoeur

0.286

Prince
Fielder

0.286

Luke
Scott

0.286

Jack
Wilson

0.286

Jim
Edmonds

0.286

Jason
Michaels

0.285

Corey
Patterson

0.285

Carlos
Gonzalez

0.285

Gerald
Laird

0.285

Alfredo
Amezaga

0.285

Scott
Hairston

0.285

Brandon
Inge

0.284

Ken
Griffey Jr.

0.283

Willy
Aybar

0.283

Edwin
Encarnacion

0.283

Carlos
Pena

0.283

Jack
Hannahan

0.282

Yadier
Molina

0.282

Jose
Bautista

0.282

Troy
Glaus

0.282

Chris
Young

0.282

Nick
Punto

0.281

Jason
Varitek

0.281

Brian
Schneider

0.280

Rod
Barajas

0.279

Emil
Brown

0.279

Jesus
Flores

0.278

Luis
Gonzalez

0.278

Adam Dunn

0.278

Dioner
Navarro

0.278

Juan
Uribe

0.277

Chris
Snyder

0.277

Gary
Sheffield

0.276

Mike
Jacobs

0.276

Miguel
Olivo

0.275

Marcus
Thames

0.275

John Buck

0.275

Pat
Burrell

0.274

Bobby
Crosby

0.273

Brad
Wilkerson

0.272

Pedro
Feliz

0.271

Jason
Giambi

0.271

Nick
Swisher

0.271

Joe Crede

0.270

Kevin
Millar

0.270

Khalil
Greene

0.269

Mark
Ellis

0.265

Craig
Counsell

0.263

Jeff
Mathis

0.256

Omar
Vizquel

0.253

 

 

In doing my previous article for StatSpeak on projection
systems and their ability to project various statistics, I realized that many
of them were not especially good at projecting BABIP.  ZiPS has a tendency to project hitters to
have very extreme BABIPs that are unlikely to occur.  PECOTA has a tendency to project speedy
hitters to have the same high BABIPs that speedy hitters used to have before
scouting data became some advanced (as PECOTA uses historical comparables), and
CHONE safely projects hitters towards the mean, but all of the systems I
studied had correlations with true BABIP of about .40-.44.  Even for those players with one year of data,
the correlation I found was around .44 and for those hitters with three or more
years of historical data, my projected BABIP had a correlation with true BABIP
of .57.  These systems do incredibly with
projected the three true outcomes, but 70.3% of plate appearances in 2008
resulted in an outcome other than a walk, strikeout, homerun, or hit by
pitch.  As far as I know, none of the
major projection systems use batted ball data for hitters to project
statistics.  These systems are getting
very good, and as Tom Tango has pointed out multiple times, the best systems
only do slightly better than Marcel the Monkey. 
I strongly believe that the way to improve projection is to incorporate these variables that I
have used above to project BABIP in isolation.

About these ads

5 Responses to Improving BABIP Projection by Batted Ball Types

  1. Red Sox Talk says:

    Couldn’t agree with you more. I am trying to use GB/FB/LD data in predicting BABIP for my projection system, but it’s really rudimentary right now:
    http://fantasyscope.wordpress.com/2008/11/13/2009-fantasyscope-early-projections/
    As an approximation, I average out historical BABIP with projected based on that data. Please see my post on BABIP estimation:
    http://saberrattling.wordpress.com/2008/12/03/working-the-numbers-on-babip-estimation/

  2. jinaz says:

    You’ve seen PrOPS, right? It wasn’t a projection system, per se (no attempt to incorporate age, no regression to the mean, etc), but rather a system to try to identify lucky/unlucky batters in a given year based on their batted ball statistics. It seemed to me to be a big step forward, and I still use it as a diagnostic for hitters. But as you said, “no one” has gone the next step and included batted ball data into a sophisticated projection system. -j

  3. Matt Swartz says:

    I’ve seen PrOPS before, but it seems like more of a “postdictive” rather than predictive system, right? It more or less predicts what BABIP should have been in the past year based on GB/FB and LD% that occurred in that year, right?
    That’s a little different only since this model was more or less predicting not only what GB%, FB%, and LD% were likely to be next year based on what they were the past few years, but also what BABIP on each of those batted balls. I think that people tend to ignore that power hitters have better averages on line drives because they hit them further, guys who pop up a lot tend to have lower averages on flyballs because infield flies are easier to catch, and faster players have better batting average on groundballs.
    Your point still holds, and I probably should have held back the statement a little bit. What I meant was that you can improve projections of those systems like CHONE, PECOTA, OLIVER, MARCEL, and ZIPS by looking at BABIP projection using batted ball rates and BABIP by batted ball.

  4. Mike says:

    This is some heavy work, Matt, and I’m glad you put the hours into this valuable topic. Good job overall.
    There is, however, one problem here. In a regression, you cannot include an interaction term like “(FB%)*(FB-BABIP)” without also including the individual components. You would need to add FB-BABIP.
    Also, as a side note, when you report an r^2 of .19, it means the model can explain 19% of the variance, not 44% [the sqrt(.19)] as you have said here.
    Finally, have you seen the work of Chris Dutton and Peter Bendix? You can find their work here:
    http://www.hardballtimes.com/main/article/batters-and-babip/
    Or, for what looks to be the full academic paper, here:
    http://tangotiger.net/tufts/understandingBABIP.pdf
    It would be interesting to see how your two (similar) approaches measure up to one another -and to other models – at the end of the year. I hope you will run an update for us!

  5. Matt Swartz says:

    I used the FB%*FB-BABIP term okay, I think, for the ones I did used. It is basically the percentage of balls in play on which a hitter gets a hit on a fly ball. The implication is that perhaps for smaller sample sizes like one year of data, it is best to directly consider how often that happens than to focus on what percent of balls in play are flyballs and what percent of those are hits.
    You’re right about r^2. Just sloppy on my part. Thank you for pointing that out.
    I have read that article. That article was retrospectively predicting BABIP, kind of like the LD%+.120 model that Studenmand developed a while back. It was not intended to predict BABIP in the future. It also had a few structural issues like regressing BABIP and pitches per extra base hit, which obviously is going to be negatively correlated because you’re using the part of the numerator of the dependent variable in one of the denominators of an independent variable. It was a good start though.
    I developed a model of predicting BABIP in the future on individual batted ball types over at another blog shortly after that:
    http://www.thegoodphight.com/2009/1/16/726379/babip-projection-and-new-s
    Dutton later tried one of those on his own, but used a lot of the same variables that was explained in an article Derek Carty wrote over at THT.
    http://www.hardballtimes.com/main/fantasy/article/whats-the-best-babip-estimator/
    I later developed a couple other articles which topped that r^2 though. Here is the first in that series:
    http://www.thegoodphight.com/2009/2/2/743228/improving-babip-estimation
    I’m having trouble finding the second in that series, but it was here on StatSpeak but I can’t find the link. This above article was the 3rd in that series, and it did not include some of the variables for the larger dataset I used here, so my r^2 fell a little for the set with less years, but the r^2 in general is going to be higher this way since there are more direct historical BABIPs rather than focusing on correlates. There’s is useful if you don’t have many years of data but you have a lot of data about one year, but I still think it’s best to use BABIPs in previous years directly.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: