Baserunning and its’ Dependence on Environment
April 20, 2007 18 Comments
First of all, my apologies on the long hiatus since my last post. I was dealing with writing what turned out to be a 29 page senior thesis on the lack of skill of current US numerical forecast models in predicting the path and weather impacts from Nor’easters and presenting my findings to a general scientific audience. Not an easy couple of weeks, but things get a little easier from here, so I should have some time to comment more regularly again.
On the matter proposed in the title of this post, I’ve been working for some time on refining a method for using PBP event data to rate every aspect of baseball performance, and one of the most difficult areas to assess is baserunning. It’s difficult because there are frequently multiple baserunners, the result of a baserunning play is heavily dependent on the batted ball trajectory or direction on the field, the skill of other runners, and the skill of the fielders. In general, however, I plan to apply the same method to rate baserunning that Tom Ruane pioneered several years ago using a smaller data set (1973-1992 only) with a few important changes. The method goes something like this:
- Given a starting base/out state, a batted ball trajectory, and a basic event type, find the average resulting run expectency after all similar plays conclude.
- Figure out the run expectency after this particular play.
- Charge differences to the runners based on repeatable methods of distributing those differences.
For a single baserunner and a typical ball in play, this is fairly straight forward. If the guy on first gets to third 25% of the time on average on a single, and on this play, he got to third, you would find the value of reaching third and the value of reaching only second,subtract the average final run expectency from the run expectency of runners at the corners and there you go.
For multiple runners, it starts getting complex. If there are runners at first and second and a single is hit, the runner from first can only go to third if the runner from second tries to score. In short, the lead runner who can be forced sets the tone for the rest of the baserunners behind him and of the runners who cannot be forced, the lead runner sets the tone for the followers (runners at second and third for example, the runner at second can only tag and go to third on a fly ball if the runner on third tagged). Or more generally, the most advanced baserunner is more important than the one before him, who is more important than the one before him who is more important than the batter.
As such, I believe the best way to rate baserunning depends on something called conditional probability. You would phrase a question like this: “Given that the runner on second scored on this single, what is the probability that the batter reached second on a fielder’s choice throw home?”
This approach comes with problems though. For rare events (for example, bases loaded, one out, a ground ball single is hit, the first two runners score, the runner at first is thrown out at third, the third basemen then tries to throw out the batter who is advancing to second on the throw to third and lobs the ball into right field allowing the batter to score the third run of the play), conditional probabilities get all blowed up as you can imagine. How many times does the runner at first get thrown out trying for third from a bases loaded/one out starting state on a groundball single…let alone all of the other crazy stuff I mentioned happening after that? In all 49 years of PBP availability it’s happened 11 times…the exact play I just described.
To combat this problem of small sample sizes without giving up on conditional probability, I thought I could make the assumption that while the rate at which batting events occurred changed in different leagues, thus affecting the run scoring environment, the state to state probabilities probably didn’t change much for any given event. You’re just as likely to go from first to third on a single now as you were in 1968.
I thought wrong.
I tested that assumption using a very simple condition…less than two outs, runner at first (no other runners) and the batter hits a ground ball single. That’s it. What I found was documented in this article over at detectovision.com: http://detectovision.com/?p=1027
Suffice it to say, I am now convinced that linear correlation between run scoring rate and baserunning probabilities is necessary in order to allow me to continue to use the entire PBP database as my sample rather than individual leagues (to keep sample sizes fairly big) without losing accuracy. I’d be interested in some of your thoughts as to what the best approach to this problem might be.