Skills, Repeatability, and Peripherals
March 2, 2009 1 Comment
Normal
0
false
false
false
MicrosoftInternetExplorer4
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:”Table Normal”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:”";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:”Times New Roman”;
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}
With Pitch FX a few years old and Hit FX around the corner
for next year, I thought it was important to figure out what exactly the
sabermetric community should be using this new information for when it comes
out. The argument that is frequently
made for looking at peripherals statistics is that they are more
repeatable. For instance, strikeout
rate, walk rate, and homerun rate are more repeatable for both pitchers and
hitters than batting average on balls in play. As a result, researchers have started studying
pitcher and hitter performance by relying on more repeatable skills. In fact, newer statistics such as contact
rate, swing rate, and many others are being used to help determine the
reliability of those statistics.
I do not mean to critique this form of research, but rather,
I intend to specify its focus. I believe
that peripheral statistics are most useful when dealing with less data. Many statistics–even those with relatively
high autocorrelation–suffer from small sample size when you only have one year
of data to choose from. They are
imperfect approximations of the player’s true skill level. Hence, sabermetric researchers look for more
reliable statistics to determine what the player’s true skill level is.
In this article, I will show several regression analyses for
a few different statistics and a clear pattern will emerge–peripheral
statistics are only useful when you have insufficient data on the statistic you
seek to predict.
First consider strikeouts per at-bat for hitters with more
than 300 PA for any consecutive pair of years within 2005-2008.
|
Source |
SS |
df |
MS |
|
#Obs |
625 |
|
Model |
1.754011 |
4 |
0.438503 |
|
F(4,620) |
512.17 |
|
Residual |
0.530821 |
620 |
0.000856 |
|
Prob>F |
0 |
|
Total |
2.284831 |
624 |
0.003662 |
|
R-squared |
0.7677 |
|
|
|
|
|
|
Adj R^2 |
0.7662 |
|
|
|
|
|
|
RMSE |
0.02926 |
|
|
|
|
|
|
|
|
|
K%2 |
Coef. |
Std.Err. |
t |
P>|t| |
95%CI |
95%CI |
|
K%1 |
0.710873 |
0.047914 |
14.84 |
0 |
0.61678 |
0.804965 |
|
O-Contact%1 |
-0.07572 |
0.01615 |
-4.69 |
0 |
-0.10744 |
-0.04401 |
|
Z-Contact%1 |
-0.10515 |
0.061988 |
-1.7 |
0.09 |
-0.22689 |
0.016578 |
|
Swing%1 |
-0.04951 |
0.025225 |
-1.96 |
0.05 |
-0.09905 |
2.36E-05 |
|
_cons |
0.212677 |
0.067814 |
3.14 |
0.002 |
0.079503 |
0.34585 |
Here, K%2 is K/AB in year 2 and K%1 is K/AB in year 1,
O-Contact%1 is Contact rate per swing on pitches out of the strike zone in year
1, Z-Contact%1 is Contract rate per swing on pitches in the strike zone, and
Swing%1 is percent of pitches swung at in year 1. Note that these peripheral statistics are
significant with only one year of data available.
However, when you increase the sample size a little bit, add
in another year, and these statistics are no longer useful. Consider the following regression output:
|
Source |
SS |
df |
MS |
|
#Obs |
137 |
|
Model |
0.379447 |
5 |
0.075889 |
|
F(5,131) |
108.58 |
|
Residual |
0.09156 |
131 |
0.000699 |
|
Prob>F |
0 |
|
Total |
0.471007 |
136 |
0.003463 |
|
R^2 |
0.8056 |
|
|
|
|
|
|
Adj R^2 |
0.7982 |
|
|
|
|
|
|
RMSE |
0.02644 |
|
K%08 |
Coef. |
Std.Err. |
t |
P>|t| |
95%CI |
95%CI |
|
K%06 |
0.269959 |
0.083118 |
3.25 |
0.001 |
0.105532 |
0.434386 |
|
K%07 |
0.548566 |
0.115553 |
4.75 |
0 |
0.319975 |
0.777158 |
|
O-Contact%07 |
-0.01525 |
0.041816 |
-0.36 |
0.716 |
-0.09797 |
0.067474 |
|
Z-Contact%07 |
-0.09129 |
0.11999 |
-0.76 |
0.448 |
-0.32865 |
0.146084 |
|
Swing%07 |
-0.02695 |
0.05308 |
-0.51 |
0.613 |
-0.13195 |
0.078059 |
|
_cons |
0.133399 |
0.142735 |
0.93 |
0.352 |
-0.14896 |
0.415762 |
While the statistics maintain their original sign, they are
no longer remotely statistically significant at all. Adding in a third year only further
strengthens this case. In fact, the
strikeout rate statistic itself from the previous year is more relevant than
the peripheral statistics from the previous year if you did have to choose
between one or the other.
|
Source |
SS |
df |
MS |
|
#Obs |
625 |
|
Model |
1.565549 |
3 |
0.52185 |
|
F(3,621) |
450.24 |
|
Residual |
0.719282 |
621 |
0.001158 |
|
Prob>F |
0 |
|
Total |
2.284831 |
624 |
0.003662 |
|
R^2 |
0.6852 |
|
|
|
|
|
|
Adj R^2 |
0.6837 |
|
|
|
|
|
|
RMSE |
0.03403 |
|
K%2 |
Coef. |
Std. |
t |
P>|t| |
95%CI |
95%CI |
|
Ocontact%1 |
-0.1576 |
0.017654 |
-8.93 |
0 |
-0.19226 |
-0.12293 |
|
Zcontact%1 |
-0.82171 |
0.045198 |
-18.18 |
0 |
-0.91046 |
-0.73295 |
|
Swing%1 |
-0.20744 |
0.0266 |
-7.8 |
0 |
-0.25968 |
-0.15521 |
|
_cons |
1.096603 |
0.037677 |
29.11 |
0 |
1.022614 |
1.170592 |
|
Source |
SS |
df |
MS |
|
#Obs |
625 |
|
Model |
1.729715 |
1 |
1.729715 |
|
F(1,623) |
1941.24 |
|
Residual |
0.555116 |
623 |
0.000891 |
|
Prob>F |
0 |
|
Total |
2.284831 |
624 |
0.003662 |
|
R^2 |
0.757 |
|
|
|
|
|
|
Adj R^2 |
0.7567 |
|
|
|
|
|
|
RMSE |
0.02985 |
|
K%2 |
Coef. |
Std. |
t |
P>|t| |
95%CI |
95%CI |
|
K%1 |
0.888271 |
0.020161 |
44.06 |
0 |
0.84868 |
0.927862 |
|
_cons |
0.020907 |
0.003741 |
5.59 |
0 |
0.013561 |
0.028252 |
The R^2 statistic is far larger for regressing K% in the
second year on K% in the first year than by trying to construct a method to
predict K% in the second year as a function of contact rate on pitches in and
out of the strike zone, and swing rate.
Walk rate is similar.
Initially, adding in peripheral statistics helps predict walk rate. Note the statistical significance:
|
Source |
SS |
df |
MS |
|
#Obs |
625 |
|
Model |
0.455716 |
2 |
0.227858 |
|
F(2,622) |
478.94 |
|
Residual |
0.29592 |
622 |
0.000476 |
|
Prob>F |
0 |
|
Total |
0.751636 |
624 |
0.001205 |
|
R^2 |
0.6063 |
|
|
|
|
|
|
Adj R^2 |
0.605 |
|
|
|
|
|
|
RMSE |
0.02181 |
|
BB%2 |
Coef. |
Std. |
t |
P>|t| |
95%CI |
95%CI |
|
BB%1 |
0.719233 |
0.032692 |
22 |
0 |
0.655034 |
0.783432 |
|
O-Swing%1 |
-0.05972 |
0.018512 |
-3.23 |
0.001 |
-0.09607 |
-0.02337 |
|
_cons |
0.040446 |
0.006507 |
6.22 |
0 |
0.027668 |
0.053224 |
Of course, add in a second year of data, and it is no longer
useful to include O-Swing% from the previous year.
|
Source |
SS |
df |
MS |
|
#Obs |
137 |
|
Model |
0.110236 |
3 |
0.036745 |
|
Prob(3,133) |
91.06 |
|
Residual |
0.053671 |
133 |
0.000404 |
|
Prob>|F| |
0 |
|
Total |
0.163907 |
136 |
0.001205 |
|
R^2 |
0.6726 |
|
|
|
|
|
|
Adj R^2 |
0.6652 |
|
|
|
|
|
|
RMSE |
0.02009 |
|
BB%08 |
Coef. |
Std. |
t |
P>|t| |
95%CI |
95%CI |
|
BB%06 |
0.256429 |
0.079107 |
3.24 |
0.002 |
0.099959 |
0.412899 |
|
BB%07 |
0.535592 |
0.088404 |
6.06 |
0 |
0.360732 |
0.710451 |
|
O-Swing%07 |
-0.05531 |
0.037621 |
-1.47 |
0.144 |
-0.12972 |
0.019107 |
|
_cons |
0.034832 |
0.01453 |
2.4 |
0.018 |
0.006093 |
0.063572 |
In the interest of space, I will leave out some other
regressions I ran but the same phenomenon occurred for log-homerun rate for
hitters, strikeout rate for pitchers, and walk rate for pitchers, and several
other statistics exhibit similar patterns as well.
The general point that I am making is that as Hit FX and
more statistics become available, the statistics that we use that better
represent certain skills–contact rate, swing rate, groundball rate, etc.–are used
differently by different hitters to yield different results. As we try to predict different results, the
most useful statistics to use to predict them are often historical records of
those very statistics themselves. In other words, these new statistics are going to be most useful when trying to predict second year players, and not going to help one add insight into predicting the performance of veterans.
I just skimmed through the article, but it’s a very interesting read, and I’ll be subscribing to the RSS feed.
~ Austin