Hit ‘Em Where They Ain’t
September 10, 2008 5 Comments
More than 25 years ago Bill James introduced Defense Efficiency Rating as the percentage of batted balls each team turns into an out. There were just a couple simple concepts. Every ball put into play (except for those too high on or over the fence) is an opportunity for an out, into which successful outs are divided. It did not matter whether a batter reached by a hit or an error, because the defense had failed to convert the out. There was no judgement involved, as the batter was either on base or put out.
I considered if this principle could be applied to individual fielding. Of all those balls in play, the first question should be “Who?” – just identify which defensive player had the best chance of making the out on each of them. The only judgement involved was which fielder the ball was closest to. The next question is “Where?”, to help determine the degree of difficulty. However, this could not be done at the time, as play by play was not yet publicly available. Many play by play defensive metrics have been developed over that past few years, but they are still limited by the level of detail present. When deciding which data to record, we must keep in mind what we are trying to accomplish. Needed data which is not recorded causes us later on to have to estimate what was skipped.
When designing a data collection system, it’s important to keep it simple. Maximize the utility, while minimizing the complexity. Even today, the who and where are not always explicitly recorded for each and every play in the data that’s available to the public. Here are the first items I would reccomend for improvement.
Eliminate Split Zones
A hit recorded as a ground single to leftfield, betwen short and third (zone 56) is an example of missing data that is referred to as a “split zone”. Because the responsible fielder was not recorded, we have to construct an estimate to split the responsibilty between the shortstop and the thirdbaseman. This would not be necessary, and we would have a more accurate measure, if the data was recorded at the time of the play.
Where Did It Land?
GameDay records the location where the ball was retrieved, and which fielder retrieved it. If you don’t also record where the ball landed there can be distortions. Did a double land on the warning track, or did it roll there? We know how far the fly ball outs went, but not the fly ball hits. This has a great impact on determining which flies would be homeruns in different ballparks.
How to Fix It
It would be great if the folks at MLB could make changes in the data that is recorded in GameDay. If not, an alternative could be a smale scale “Project Scoresheet”. mlb.com makes their video package available for $15 a month, and $15 for the off-season. The Condensed Games feature shows most if not all of the batted ball hits in 10 to 12 minutes of video. After the play by play has been processed into an easy to read form, the missing data could be entered by a fan. At 4 games per hour, a home season could be completed in approximately 20 hours, short enough for one person to be able to process.
Suggested List of Batted Ball Data Fields
The first two are not explicitly recorded by GameDay or coded as fields in Retrosheet