Skills, Repeatability, and Peripherals

Normal
0

false
false
false

MicrosoftInternetExplorer4

/* Style Definitions */
table.MsoNormalTable
{mso-style-name:”Table Normal”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:””;
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:”Times New Roman”;
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}

With Pitch FX a few years old and Hit FX around the corner
for next year, I thought it was important to figure out what exactly the
sabermetric community should be using this new information for when it comes
out.  The argument that is frequently
made for looking at peripherals statistics is that they are more
repeatable.  For instance, strikeout
rate, walk rate, and homerun rate are more repeatable for both pitchers and
hitters than batting average on balls in play.  As a result, researchers have started studying
pitcher and hitter performance by relying on more repeatable skills.  In fact, newer statistics such as contact
rate, swing rate, and many others are being used to help determine the
reliability of those statistics. 

 

I do not mean to critique this form of research, but rather,
I intend to specify its focus.  I believe
that peripheral statistics are most useful when dealing with less data.  Many statistics–even those with relatively
high autocorrelation–suffer from small sample size when you only have one year
of data to choose from.  They are
imperfect approximations of the player’s true skill level.  Hence, sabermetric researchers look for more
reliable statistics to determine what the player’s true skill level is. 

 

In this article, I will show several regression analyses for
a few different statistics and a clear pattern will emerge–peripheral
statistics are only useful when you have insufficient data on the statistic you
seek to predict.

 

First consider strikeouts per at-bat for hitters with more
than 300 PA for any consecutive pair of years within 2005-2008.

 

Source

SS

df

MS

 

#Obs

625

Model

1.754011

4

0.438503

 

F(4,620)

512.17

Residual

0.530821

620

0.000856

 

Prob>F

0

Total

2.284831

624

0.003662

 

R-squared

0.7677

 

 

 

 

 

Adj R^2

0.7662

 

 

 

 

 

RMSE

0.02926

 

 

 

 

 

 

 

K%2

Coef.

Std.Err.

t

P>|t|

95%CI
min

95%CI
max

K%1

0.710873

0.047914

14.84

0

0.61678

0.804965

O-Contact%1

-0.07572

0.01615

-4.69

0

-0.10744

-0.04401

Z-Contact%1

-0.10515

0.061988

-1.7

0.09

-0.22689

0.016578

Swing%1

-0.04951

0.025225

-1.96

0.05

-0.09905

2.36E-05

_cons

0.212677

0.067814

3.14

0.002

0.079503

0.34585

 

Here, K%2 is K/AB in year 2 and K%1 is K/AB in year 1,
O-Contact%1 is Contact rate per swing on pitches out of the strike zone in year
1, Z-Contact%1 is Contract rate per swing on pitches in the strike zone, and
Swing%1 is percent of pitches swung at in year 1.  Note that these peripheral statistics are
significant with only one year of data available.

 

However, when you increase the sample size a little bit, add
in another year, and these statistics are no longer useful.  Consider the following regression output:

 

Source

SS

df

MS

 

#Obs

137

Model

0.379447

5

0.075889

 

F(5,131)

108.58

Residual

0.09156

131

0.000699

 

Prob>F

0

Total

0.471007

136

0.003463

 

R^2

0.8056

 

 

 

 

 

Adj R^2

0.7982

 

 

 

 

 

RMSE

0.02644

K%08

Coef.

Std.Err.

t

P>|t|

95%CI
min

95%CI
max

K%06

0.269959

0.083118

3.25

0.001

0.105532

0.434386

K%07

0.548566

0.115553

4.75

0

0.319975

0.777158

O-Contact%07

-0.01525

0.041816

-0.36

0.716

-0.09797

0.067474

Z-Contact%07

-0.09129

0.11999

-0.76

0.448

-0.32865

0.146084

Swing%07

-0.02695

0.05308

-0.51

0.613

-0.13195

0.078059

_cons

0.133399

0.142735

0.93

0.352

-0.14896

0.415762

 

While the statistics maintain their original sign, they are
no longer remotely statistically significant at all.  Adding in a third year only further
strengthens this case.  In fact, the
strikeout rate statistic itself from the previous year is more relevant than
the peripheral statistics from the previous year if you did have to choose
between one or the other.

 

Source

SS

df

MS

 

#Obs

625

Model

1.565549

3

0.52185

 

F(3,621)

450.24

Residual

0.719282

621

0.001158

 

Prob>F

0

Total

2.284831

624

0.003662

 

R^2

0.6852

 

 

 

 

 

Adj R^2

0.6837

 

 

 

 

 

RMSE

0.03403

K%2

Coef.

Std.
Err.

t

P>|t|

95%CI
min

95%CI
max

Ocontact%1

-0.1576

0.017654

-8.93

0

-0.19226

-0.12293

Zcontact%1

-0.82171

0.045198

-18.18

0

-0.91046

-0.73295

Swing%1

-0.20744

0.0266

-7.8

0

-0.25968

-0.15521

_cons

1.096603

0.037677

29.11

0

1.022614

1.170592

 

Source

SS

df

MS

 

#Obs

625

Model

1.729715

1

1.729715

 

F(1,623)

1941.24

Residual

0.555116

623

0.000891

 

Prob>F

0

Total

2.284831

624

0.003662

 

R^2

0.757

 

 

 

 

 

Adj R^2

0.7567

 

 

 

 

 

RMSE

0.02985

K%2

Coef.

Std.
Err.

t

P>|t|

95%CI
min

95%CI
max

K%1

0.888271

0.020161

44.06

0

0.84868

0.927862

_cons

0.020907

0.003741

5.59

0

0.013561

0.028252

 

The R^2 statistic is far larger for regressing K% in the
second year on K% in the first year than by trying to construct a method to
predict K% in the second year as a function of contact rate on pitches in and
out of the strike zone, and swing rate.

 

Walk rate is similar. 
Initially, adding in peripheral statistics helps predict walk rate.  Note the statistical significance:

 

Source

SS

df

MS

 

#Obs

625

Model

0.455716

2

0.227858

 

F(2,622)

478.94

Residual

0.29592

622

0.000476

 

Prob>F

0

Total

0.751636

624

0.001205

 

R^2

0.6063

 

 

 

 

 

Adj R^2

0.605

 

 

 

 

 

RMSE

0.02181

BB%2

Coef.

Std.
Err.

t

P>|t|

95%CI
min

95%CI
max

BB%1

0.719233

0.032692

22

0

0.655034

0.783432

O-Swing%1

-0.05972

0.018512

-3.23

0.001

-0.09607

-0.02337

_cons

0.040446

0.006507

6.22

0

0.027668

0.053224

 

Of course, add in a second year of data, and it is no longer
useful to include O-Swing% from the previous year.

 

Source

SS

df

MS

 

#Obs

137

Model

0.110236

3

0.036745

 

Prob(3,133)

91.06

Residual

0.053671

133

0.000404

 

Prob>|F|

0

Total

0.163907

136

0.001205

 

R^2

0.6726

 

 

 

 

 

Adj R^2

0.6652

 

 

 

 

 

RMSE

0.02009

BB%08

Coef.

Std.
Err.

t

P>|t|

95%CI
min

95%CI
max

BB%06

0.256429

0.079107

3.24

0.002

0.099959

0.412899

BB%07

0.535592

0.088404

6.06

0

0.360732

0.710451

O-Swing%07

-0.05531

0.037621

-1.47

0.144

-0.12972

0.019107

_cons

0.034832

0.01453

2.4

0.018

0.006093

0.063572

 

In the interest of space, I will leave out some other
regressions I ran but the same phenomenon occurred for log-homerun rate for
hitters, strikeout rate for pitchers, and walk rate for pitchers, and several
other statistics exhibit similar patterns as well.

 

The general point that I am making is that as Hit FX and
more statistics become available, the statistics that we use that better
represent certain skills–contact rate, swing rate, groundball rate, etc.–are used
differently by different hitters to yield different results.  As we try to predict different results, the
most useful statistics to use to predict them are often historical records of
those very statistics themselves.  In other words, these new statistics are going to be most useful when trying to predict second year players, and not going to help one add insight into predicting the performance of veterans.

About these ads

One Response to Skills, Repeatability, and Peripherals

  1. AustinMatherne says:

    I just skimmed through the article, but it’s a very interesting read, and I’ll be subscribing to the RSS feed.
    ~ Austin

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: