Skills, Repeatability, and Peripherals
March 2, 2009 1 Comment
Normal
0
false
false
false
MicrosoftInternetExplorer4
/* Style Definitions */
table.MsoNormalTable
{msostylename:”Table Normal”;
msotstylerowbandsize:0;
msotstylecolbandsize:0;
msostylenoshow:yes;
msostyleparent:””;
msopaddingalt:0in 5.4pt 0in 5.4pt;
msoparamargin:0in;
msoparamarginbottom:.0001pt;
msopagination:widoworphan;
fontsize:10.0pt;
fontfamily:”Times New Roman”;
msoansilanguage:#0400;
msofareastlanguage:#0400;
msobidilanguage:#0400;}
With Pitch FX a few years old and Hit FX around the corner
for next year, I thought it was important to figure out what exactly the
sabermetric community should be using this new information for when it comes
out. The argument that is frequently
made for looking at peripherals statistics is that they are more
repeatable. For instance, strikeout
rate, walk rate, and homerun rate are more repeatable for both pitchers and
hitters than batting average on balls in play. As a result, researchers have started studying
pitcher and hitter performance by relying on more repeatable skills. In fact, newer statistics such as contact
rate, swing rate, and many others are being used to help determine the
reliability of those statistics.
I do not mean to critique this form of research, but rather,
I intend to specify its focus. I believe
that peripheral statistics are most useful when dealing with less data. Many statistics–even those with relatively
high autocorrelation–suffer from small sample size when you only have one year
of data to choose from. They are
imperfect approximations of the player’s true skill level. Hence, sabermetric researchers look for more
reliable statistics to determine what the player’s true skill level is.
In this article, I will show several regression analyses for
a few different statistics and a clear pattern will emerge–peripheral
statistics are only useful when you have insufficient data on the statistic you
seek to predict.
First consider strikeouts per atbat for hitters with more
than 300 PA for any consecutive pair of years within 20052008.
Source 
SS 
df 
MS 

#Obs 
625 
Model 
1.754011 
4 
0.438503 

F(4,620) 
512.17 
Residual 
0.530821 
620 
0.000856 

Prob>F 
0 
Total 
2.284831 
624 
0.003662 

Rsquared 
0.7677 





Adj R^2 
0.7662 





RMSE 
0.02926 







K%2 
Coef. 
Std.Err. 
t 
P>t 
95%CI 
95%CI 
K%1 
0.710873 
0.047914 
14.84 
0 
0.61678 
0.804965 
OContact%1 
0.07572 
0.01615 
4.69 
0 
0.10744 
0.04401 
ZContact%1 
0.10515 
0.061988 
1.7 
0.09 
0.22689 
0.016578 
Swing%1 
0.04951 
0.025225 
1.96 
0.05 
0.09905 
2.36E05 
_cons 
0.212677 
0.067814 
3.14 
0.002 
0.079503 
0.34585 
Here, K%2 is K/AB in year 2 and K%1 is K/AB in year 1,
OContact%1 is Contact rate per swing on pitches out of the strike zone in year
1, ZContact%1 is Contract rate per swing on pitches in the strike zone, and
Swing%1 is percent of pitches swung at in year 1. Note that these peripheral statistics are
significant with only one year of data available.
However, when you increase the sample size a little bit, add
in another year, and these statistics are no longer useful. Consider the following regression output:
Source 
SS 
df 
MS 

#Obs 
137 
Model 
0.379447 
5 
0.075889 

F(5,131) 
108.58 
Residual 
0.09156 
131 
0.000699 

Prob>F 
0 
Total 
0.471007 
136 
0.003463 

R^2 
0.8056 





Adj R^2 
0.7982 





RMSE 
0.02644 
K%08 
Coef. 
Std.Err. 
t 
P>t 
95%CI 
95%CI 
K%06 
0.269959 
0.083118 
3.25 
0.001 
0.105532 
0.434386 
K%07 
0.548566 
0.115553 
4.75 
0 
0.319975 
0.777158 
OContact%07 
0.01525 
0.041816 
0.36 
0.716 
0.09797 
0.067474 
ZContact%07 
0.09129 
0.11999 
0.76 
0.448 
0.32865 
0.146084 
Swing%07 
0.02695 
0.05308 
0.51 
0.613 
0.13195 
0.078059 
_cons 
0.133399 
0.142735 
0.93 
0.352 
0.14896 
0.415762 
While the statistics maintain their original sign, they are
no longer remotely statistically significant at all. Adding in a third year only further
strengthens this case. In fact, the
strikeout rate statistic itself from the previous year is more relevant than
the peripheral statistics from the previous year if you did have to choose
between one or the other.
Source 
SS 
df 
MS 

#Obs 
625 
Model 
1.565549 
3 
0.52185 

F(3,621) 
450.24 
Residual 
0.719282 
621 
0.001158 

Prob>F 
0 
Total 
2.284831 
624 
0.003662 

R^2 
0.6852 





Adj R^2 
0.6837 





RMSE 
0.03403 
K%2 
Coef. 
Std. 
t 
P>t 
95%CI 
95%CI 
Ocontact%1 
0.1576 
0.017654 
8.93 
0 
0.19226 
0.12293 
Zcontact%1 
0.82171 
0.045198 
18.18 
0 
0.91046 
0.73295 
Swing%1 
0.20744 
0.0266 
7.8 
0 
0.25968 
0.15521 
_cons 
1.096603 
0.037677 
29.11 
0 
1.022614 
1.170592 
Source 
SS 
df 
MS 

#Obs 
625 
Model 
1.729715 
1 
1.729715 

F(1,623) 
1941.24 
Residual 
0.555116 
623 
0.000891 

Prob>F 
0 
Total 
2.284831 
624 
0.003662 

R^2 
0.757 





Adj R^2 
0.7567 





RMSE 
0.02985 
K%2 
Coef. 
Std. 
t 
P>t 
95%CI 
95%CI 
K%1 
0.888271 
0.020161 
44.06 
0 
0.84868 
0.927862 
_cons 
0.020907 
0.003741 
5.59 
0 
0.013561 
0.028252 
The R^2 statistic is far larger for regressing K% in the
second year on K% in the first year than by trying to construct a method to
predict K% in the second year as a function of contact rate on pitches in and
out of the strike zone, and swing rate.
Walk rate is similar.
Initially, adding in peripheral statistics helps predict walk rate. Note the statistical significance:
Source 
SS 
df 
MS 

#Obs 
625 
Model 
0.455716 
2 
0.227858 

F(2,622) 
478.94 
Residual 
0.29592 
622 
0.000476 

Prob>F 
0 
Total 
0.751636 
624 
0.001205 

R^2 
0.6063 





Adj R^2 
0.605 





RMSE 
0.02181 
BB%2 
Coef. 
Std. 
t 
P>t 
95%CI 
95%CI 
BB%1 
0.719233 
0.032692 
22 
0 
0.655034 
0.783432 
OSwing%1 
0.05972 
0.018512 
3.23 
0.001 
0.09607 
0.02337 
_cons 
0.040446 
0.006507 
6.22 
0 
0.027668 
0.053224 
Of course, add in a second year of data, and it is no longer
useful to include OSwing% from the previous year.
Source 
SS 
df 
MS 

#Obs 
137 
Model 
0.110236 
3 
0.036745 

Prob(3,133) 
91.06 
Residual 
0.053671 
133 
0.000404 

Prob>F 
0 
Total 
0.163907 
136 
0.001205 

R^2 
0.6726 





Adj R^2 
0.6652 





RMSE 
0.02009 
BB%08 
Coef. 
Std. 
t 
P>t 
95%CI 
95%CI 
BB%06 
0.256429 
0.079107 
3.24 
0.002 
0.099959 
0.412899 
BB%07 
0.535592 
0.088404 
6.06 
0 
0.360732 
0.710451 
OSwing%07 
0.05531 
0.037621 
1.47 
0.144 
0.12972 
0.019107 
_cons 
0.034832 
0.01453 
2.4 
0.018 
0.006093 
0.063572 
In the interest of space, I will leave out some other
regressions I ran but the same phenomenon occurred for loghomerun rate for
hitters, strikeout rate for pitchers, and walk rate for pitchers, and several
other statistics exhibit similar patterns as well.
The general point that I am making is that as Hit FX and
more statistics become available, the statistics that we use that better
represent certain skills–contact rate, swing rate, groundball rate, etc.–are used
differently by different hitters to yield different results. As we try to predict different results, the
most useful statistics to use to predict them are often historical records of
those very statistics themselves. In other words, these new statistics are going to be most useful when trying to predict second year players, and not going to help one add insight into predicting the performance of veterans.
I just skimmed through the article, but it’s a very interesting read, and I’ll be subscribing to the RSS feed.
~ Austin