Hit 'Em Where They Ain't

More than 25 years ago Bill James introduced Defense Efficiency Rating as the percentage of batted balls each team turns into an out. There were just a couple simple concepts. Every ball put into play (except for those too high on or over the fence) is an opportunity for an out, into which successful outs are divided. It did not matter whether a batter reached by a hit or an error, because the defense had failed to convert the out. There was no judgement involved, as the batter was either on base or put out.
I considered if this principle could be applied to individual fielding. Of all those balls in play, the first question should be “Who?” – just identify which defensive player had the best chance of making the out on each of them. The only judgement involved was which fielder the ball was closest to. The next question is “Where?”, to help determine the degree of difficulty. However, this could not be done at the time, as play by play was not yet publicly available. Many play by play defensive metrics have been developed over that past few years, but they are still limited by the level of detail present. When deciding which data to record, we must keep in mind what we are trying to accomplish. Needed data which is not recorded causes us later on to have to estimate what was skipped.
When designing a data collection system, it’s important to keep it simple. Maximize the utility, while minimizing the complexity. Even today, the who and where are not always explicitly recorded for each and every play in the data that’s available to the public. Here are the first items I would reccomend for improvement.
Eliminate Split Zones
A hit recorded as a ground single to leftfield, betwen short and third (zone 56) is an example of missing data that is referred to as a “split zone”. Because the responsible fielder was not recorded, we have to construct an estimate to split the responsibilty between the shortstop and the thirdbaseman. This would not be necessary, and we would have a more accurate measure, if the data was recorded at the time of the play.
Where Did It Land?
GameDay records the location where the ball was retrieved, and which fielder retrieved it. If you don’t also record where the ball landed there can be distortions. Did a double land on the warning track, or did it roll there? We know how far the fly ball outs went, but not the fly ball hits. This has a great impact on determining which flies would be homeruns in different ballparks.
How to Fix It
It would be great if the folks at MLB could make changes in the data that is recorded in GameDay. If not, an alternative could be a smale scale “Project Scoresheet”. mlb.com makes their video package available for $15 a month, and $15 for the off-season. The Condensed Games feature shows most if not all of the batted ball hits in 10 to 12 minutes of video. After the play by play has been processed into an easy to read form, the missing data could be entered by a fan. At 4 games per hour, a home season could be completed in approximately 20 hours, short enough for one person to be able to process.
Suggested List of Batted Ball Data Fields
The first two are not explicitly recorded by GameDay or coded as fields in Retrosheet

  • Responsible fielder – can be the same as the retrieving fielder
  • Location where the ball could be fielded for an out – for flyball outs or infield grounders, is the same as retrieved locations
  • Type of batted ball (ground, fly, pop, etc)
  • How hard hit – currently described as sharp, normal or soft, but could be expressed as time to fielder or speed off bat
  • Retrieving fielder
  • Location where ball was retrieved
  • Result of batted ball (out, error, single, double, triple, homeruns)
  • Level of difficulty – optional, such as in or out of zone, but can be calculated as mean rate of success of all balls with same parameters
  • Advertisements

    5 Responses to Hit 'Em Where They Ain't

    1. Brian Cartwright says:

      I would hope that if there was at least one person per major league team, there would not be very much time per person. I’m also hopeful that mlb.com might be convinced to let the GameDay operator do it.
      I purposely avoided mentioning any particular fielding system in this article. My point was, and I can show later, that any system is hampered when they have a lack of data. These were the items that I thought were the most important, but also the easiest to include.
      We need to aim for the maximum detail, while still be simple and efficient.
      Precise, granular data can always be merged when that level of precision is not required, but splitting less precise, course data, is always less accurate and time consuming when the detail is required.

    2. Pizza Cutter says:

      So, I could spend my off-season watching a bunch of Indians games… for science sweetie, for science.

    3. dan says:

      500 hours of data calculation? If one person did it each day during the baseball season, it would be about 2 hours and 45 minutes per day (181 or 182 days, I forget).
      Let’s say we searched the country for every person who created their own fielding system… how many would that be? 15? 30? If there are 15 people willing to spend 10 to 12 minutes a day of volunteer work (on average), then every game would be covered.

    4. Colin Wyers says:

      I would still submit that the best piece of data to collect is the location of the fielder when the ball was hit.
      Now, the camera angles on Condensed Games won’t show us that. But I’m not entirely certain they’ll always tell us who the responsible fielder should have been – if you don’t know where the third baseman and shortstop were positioned, how do you know which one was responsible? If the third baseman is playing in against the bunt or not definitely changes what balls he should and shouldn’t be expected to field for an out.

    5. Brian Cartwright says:

      Colin, I agree that is something worthwhile, but it’s not as important as knowing who the ball was hit to.
      DER, as a team stat, counts all balls on the field. If we want to extend it to the individual level, the first question should be Who?, and then Where? Our data currently tells us where but not always who.
      Once we know who and where, then the fielder’s position becomes important to know how far he had to go to get to where. We could establish a bubble around each fielder, with different sizes and shapes for each individual.

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out /  Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out /  Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out /  Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out /  Change )


    Connecting to %s

    %d bloggers like this: