Does PECOTA overestimate the batting averages for fast players?
March 15, 2009 5 Comments
Normal
0
false
false
false
MicrosoftInternetExplorer4
/* Style Definitions */
table.MsoNormalTable
{msostylename:”Table Normal”;
msotstylerowbandsize:0;
msotstylecolbandsize:0;
msostylenoshow:yes;
msostyleparent:””;
msopaddingalt:0in 5.4pt 0in 5.4pt;
msoparamargin:0in;
msoparamarginbottom:.0001pt;
msopagination:widoworphan;
fontsize:10.0pt;
fontfamily:”Times New Roman”;
msoansilanguage:#0400;
msofareastlanguage:#0400;
msobidilanguage:#0400;}
There are a number of projection systems out there for
predicting player performance. All of
them are pretty good. They all make
claims of superiority from time to time, but the clear consensus is that there
is no consensus. In some ways, PECOTA
could be considered the best, but CHONE, ZiPS, Marcel, and many others have
their strengths. As I was looking
through the projections for this year, I also wondered what the systems’
weaknesses were. One thing that I
noticed was how high some of the batting averages were for speedy baseball
players for the PECOTA system. This
year, PECOTA projects batting averages for Jose Reyes, Jimmy Rollins, and
Hanley Ramirez that are more than ten points higher ZiPS and CHONE.
I decided to look at this in a more scientific way. I went through the PECOTA system’s projections
for 20062008 for the 832 players who managed 300 PA during those years. I calculated how far the players’ batting
averages exceeded their PECOTA projections.
I wanted to compare this to their Speed Score as listed by Baseball
Prospectus according to each of their projections. I figured that if I simply ran this
regression without a control for PECOTA overestimating a player’s skill, there
would be a bias there (players whose speed PECOTA overestimated would have averages below their PECOTA projection). So I developed a
control of the difference of their actual stolen base total and PECOTA’s stolen
base estimate. This should allow me to
isolate whether PECOTA overestimates batting averages for speedsters,
controlling for whether they accurately estimate the players’ speeds. Here are the results:
Source 
SS 
df 
MS 

Obs 
832 
Model 
0.02476 
4 
0.00619 

F(4,827) 
9.53 
Residual 
0.537231 
827 
0.00065 

Prob>F 
0 
Total 
0.56199 
831 
0.000676 

Rsq 
0.0441 





Adj Rsq 
0.0394 





RMSE 
0.02549 
avgPECavg 
Coef. 
Std. 
t 
P>t 
95%Cimin 
95%Cimax 
sbPECsb 
0.000597 
0.000131 
4.57 
0 
0.00034 
0.000854 
pspdtop4th 
0.00364 
0.002009 
1.81 
0.07 
0.00758 
0.000302 
yr06 
0.005326 
0.002177 
2.45 
0.015 
0.001053 
0.0096 
yr07 
0.0028 
0.002154 
1.3 
0.195 
0.00702 
0.001433 
_cons 
0.002151 
0.001604 
1.34 
0.18 
0.001 
0.005299 
(avgPECavg): average minus PECOTA projected estimate of
average
(sbPECsb): stolen bases minus PECOTA projected estimate of
stolen bases
(pspdtop4th): indicator function equal to 1 if the speed score
were in the top quarter of speed scores in that year (speed scores are measured
on a different scale for each year)
(yr06, yr07): indicator functions equal to 1 if the year was
2006 or 2007, to control for the measurement bias by year.
This is weakly statistically significant, and indicates
PECOTA does in fact overrate speedsters.
I did specifically pick the regression that looked best to
show, but for the sake of completeness, here is the regression with the number
of standard deviations above the mean their speed score was denoted “pspdz” as
a regressor. This is less significant,
since it seems that PECOTA does not do a better job of projecting slow players
than players with average speed.
Source 
SS 
df 
MS 

Obs 
832 
Model 
0.024347 
4 
0.006087 

F(4,827) 
9.36 
Residual 
0.537643 
827 
0.00065 

Prob>F 
0 
Total 
0.56199 
831 
0.000676 

Rsq 
0.0433 





Adj Rsq 
0.0387 





RMSE 
0.0255 
avgPECavg 
Coef. 
Std. 
t 
P>t 
95%Cimax 
95%Cimax 
sbPECsb 
0.000599 
0.000131 
4.57 
0 
0.000342 
0.000856 
pspdz 
0.00145 
0.000891 
1.63 
0.104 
0.0032 
0.000299 
yr06 
0.005156 
0.002177 
2.37 
0.018 
0.000883 
0.009428 
yr07 
0.00279 
0.002155 
1.29 
0.196 
0.00702 
0.001443 
_cons 
0.001231 
0.001521 
0.81 
0.418 
0.00175 
0.004217 
Here, “pspdz” is not quite significant, but is not far off. Since the distribution of “pspdz” (the number
of standard deviations the speed score is above the mean for that year) is not
distributed the same for each year, this is likely not a perfect measurement
and perhaps this is why.
Clearly, model specification is an issue, but I am afraid to
distribute my data since PECOTA projections are proprietary (and I assume historical ones are as well). For the sake of
transparency, however, I will run regressions that people request by post or email me, with alternative models
using the PECOTA data.
Moving on to 2009, I decided to compare how the top 26
basestealers as projected by PECOTA (speed score is not listed for 2009 PECOTA
projections) looked compared to CHONE and ZiPS projections. I dropped the players who did not have any
significant amount of major league experience.
Then I did the same thing for the top 26 homerun hitters as projected by
PECOTA, again comparing those to the CHONE, ZiPS, and Marcel projections. Sure enough, PECOTA projected the batting
averages for the speedy players higher than CHONE, ZiPS and Marcel, but not for
the homerun hitters.
I would paste in the table here, but again, since PECOTA’s projections are proprietary, I will only summarize the results.
For the 26 speedsters, PECOTA was the highest of the four
systems for 14 of them. It was the
second highest for 2 of them, third highest for 2 of them, and the lowest for 8
of them. For the 26 sluggers, PECOTA was
the highest for 7, tied for the highest for 4 of them, second highest for 1 of
them, third highest for 5 of them, and the lowest for 9 of them. It estimated a batting average ten points
higher than the average of CHONE, ZiPS, and Marcel for 8 speedsters, but for
only 5 sluggers (2 of whom were Beltran and Hanley Ramirez, also speedsters).
The 8 speedsters that it was the highest for were: Jose
Reyes, Jimmy Rollins, Hanley Ramirez, Michael Bourn, Carlos Gomez, Brandon
Phillips, Rickie Weeks, and Nate McLouth.
It was also pretty high on Willy Taveras, Shane Victorino, Juan Pierre,
and Corey Hart.
I would be cautious about trusting PECOTA on these
guys. It does seem that PECOTA does
indeed overestimate these hitters by a bit.
By the regression estimate, it looks like fast players may get an
exaggerated batting average boost of about 4 points. I would guess that each of the projection
systems has their weaknesses on certain players. If it were possible to determine which types
of hitters were better projected by different systems, I think that would be
extremely useful to know.
Very interesting analysis and probably something to keep in mind when using PECOTA projections.
However, being only 4 points higher is probably well within expected precision of projections. For instance a .300 hitter that has 550 at bats and gets 3 extra hits over the entire season will hit .305 (168 hits instead of 165).
Thanks, and that is a good point. The thing is that it’s 4 points higher across the board, and that is weakly statistical significant. I guess it’s equivalent to saying Marcel will project the .300 hitter to hit about .300 with a confidence interval centered around that, and PECOTA will project the .300 hitter to hit about .305 with a confidence interval centered around that so even though the confidence intervals overlap, one is better.
Also, I would guess that it’s probably not 4 points for all speedy players. Maybe it’s 10 points for 40% of the speedy players, and I just haven’t played around with the data enough to figure out who.
You’re right that 4 points doesn’t sound like much though on its own.
Aren’t we falling into the old government spending aphorism that a billion here and a billion there and soon you’re talking about real money? The issue here isn’t precision (all projections have to deal with imprecision), it’s bias in the measure (or the projection, I suppose). Good work uncovering it.
If my posts over on Primer aren’t clear on the point, I wholeheartedly concur with PC’s last sentence. My kvetching is wellintentioned.
Jeff, I appreciate the criticism. I’m working on a follow up article, and it helps me do it. I know I’m being hard on PECOTA, because it is a very good system, but I do think that systematic biases are a huge deal. I’m on the process of checking them all though.