# Who gets the credit/blame for that home run?

February 6, 2008 4 Comments

Do hitters hit home runs, or do pitchers give them up? Of course, the answer to that question runs both ways, but who is more to blame/credit? Pitchers occasionally throw such beautifully tantalizing hanging curveballs that even Rafael Belliard hit the occasional home run and some hitters are so strong that they can punish the even the best-placed pitch. But it brings up an interesting question. Who is more in control of how far the batter hits the ball? After all, a home run is simply a fly ball that went a long way and crossed over a fence.

Here’s how I (sorta) answered the question: From 1993-1998, Retrosheet’s data files contain pretty good data on hit locations, primarily because those years were compiled by Project Scoresheet and licensed to Retrosheet. Recent Retrosheet files are much more scant in this data. The way that Project Scoresheet made notations on the data was through the use of a standardized rough map of zones on a ball diamond. It’s rather rough-grained, but it takes us from being able to say that Jones flew out to center field to saying that Jones flew to shallow (or deep) center. Once we know where a fly ball went (and I selected out all balls from 1993-1998 which Retrosheet said were either pop ups or fly balls), in terms of what zone, we can get a decent appoximation of how far away that is from home plate.

I assumed that all balls attributed to a zone were hit to the exact center of that zone. Of course, that’s not true, but it’s close enough for government work (some were hit a little beyond, some a little in front… it evens out). Since the Project Scoresheet grid is meant to scale to the outfield dimensions of a park, we need to know the outfield dimensions of the park in use. (The infield dimensions of all parks are set by the official rule book). If one knows a little bit about trigonometry, it’s easy enough to get a decent guess of where the was hit to, if it was on the field of play. For home runs, I gave the hitter 105% of the wall measurement over which it crossed. (So, a HR hit to a 360 foot power alley was estimated at 378 feet.) 105% was nothing more than my guess.

I totalled up the mean estimated distance for all fly balls and pop ups hit in a season by each batter, and then turned around and sorted it by pitcher. I selected out only those with 25 fly balls in the season in question that they either hit or had hit off of them. I subjected them to an AR(1) intra-class correlation to look at the year-to-year correlations over the six years in the data set to see if the mean distance was more consistent for pitchers or for hitters.

ICC for pitchers = .312

ICC for batters = .612

Batters are fairly consistent from year-to-year in how far their average fly ball travels. Pitchers are less so, but still have some level of consistency from year to year. It seems that both share some blame/credit for the distance on a flyball. This might explain why batters seasonal rates of HR/FB were more stable than pitcher rates. For those unfamiliar with this methodology, you can interpret those numbers in much the same way as a year-to-year correlation coefficients (although this method is better, as it allows for multiple data points.) There *are *some batters who are powerful (i.e., they hit the ball a long way) and some who are not, and that power level is pretty consistent from year to year. Pitchers who give up fly balls (and all of them, save Fausto Carmona, occasionally give up a fly ball) do have some (not a lot, but it’s there) repeatable skill in whether they tend to give up short fly balls or long fly balls. For those GMs nervous about signing that fly ball pitcher because he might give up a bunch of home runs, you can check his average fly ball distance (and perhaps his standard deviation), perhaps look at it by field, and plug in a few numbers to at least give you a *little* better projection for how many HR he might give up next year, although the error of prediction is still likely to be rather high.

Let’s play around with this a bit more from the batter’s perspective. I looked at the average distances for balls hit to the batter’s pull field, opposite field, and center field. I upped the inclusion criteria to 50 FB in the season in question. Again, I looked at ICC over the six seasons in the data set. (Anything in the grid with an “8” in it was “center field”, so that includes the power alleys.)

ICC for pull field = .239

ICC for center field = .591

ICC for opposite field = .359

Batters are much more consistent in how far they hit the ball to center field (and the power alleys), and are actually more consistent in how they hit the ball to the opposite field than to their pull field. So if you want to get a good idea of how a player will hit for power, take a look at what he does gap to gap. That’s going to be the most consistent measure.

Something that could be useful for a GM is looking at hit tracker’s data for the “just enough” and “no doubt” home runs by the pitcher. A pitcher who gives up a disproportionate amount of lucky or just enough home runs will probably regress to the mean and give up fewer home runs the next season. The opposite obviously applies to lucky pitchers. This doesn’t solve the question of “who has the control?” but it would be useful nevertheless.

PC,

You had a little bit of a tutorial on ICC linked to above, but I have a couple questions regarding it:

One, is there a (simple) statistical package that easily computes it? Can it be done in Excel (not that I use Excel)?

Can you give an example of using the log of the odds ratio of rate variables. So rather than using HR per BIP, for example, you would compute the odds ratio of the HR and the BIP, then take the log of that (base 10 or natural or does it matter?) and use that for your variable? Is there a simple answer why that is better? Is it much better?

When I do y-t-y correlations (using simple linear regressions of course), I can take a large data set, like 99-07 and run y-t-y for each year pair, 99/00, 00/01, 01/02, etc., although I usually do 99/00 and then 01/02, 03/04, etc. It’s not like you HAVE TO use only one pair of years (I don’t know why you suggest that you can’t use an entire data set! Is doing that (using all years) so much worse than doing an ICC? It seems like it is almost the samet thing. You could also do what BP likes to do, which is to use odd years versus even years. Again, is that much worse than ICC?

Thanks in advance for the response!

BTW, I don’t like using words like “consistency” when describing correlations, be it y-t-y or ICC. I mean, if there is absolutely no spread of talent in a population (all variance is account of luck), then players would still be “consistent” from year to year. If you had large enough samples in each year, players would bunch around the mean (everyone has the same true mean) in year 1 and then bunch around the mean in year 2 also. They would be quite “consistent.” It would just be that there would ne no relationship between any of the variance in year 1 with year 2 (and deviation from the mean in year 1 would have no predictive value).

I prefer to say that the correlations tell us the likely spread of talent withing the population – not whether a player is “consistent” or not from year to year with respect to that stat, although we are saying the same thing. If you want to talk about “consistency” it should be mentioned/explained that what you mean is how well any deviation from the mean in one year predicts the next year. Again, if there is little deviation from the mean in both yeats it suggests two things: One, there is almost no skill in the stat. Two, the sample sizes in each year are large enough to keep the random spreads down to a minimum. But there still might be lots of “consistency” even though there is no spread of skill in the population.

The converse holds true. There might be a large spread of skill in the population, but if the sample sizes are small enough in each year, there will be little “consistency” and it will look like there is a lot of luck.

We have discussed this before on The Book blog, but we can NEVER talk about the ration of skill to luck, unless we specify the sample sizes we are talking about. At large enough sample sizes, almost EVERYTHING (including BABIP for pitchers) is ALL skill and no luck, and with small enough sample sizes, everything (even BB and K rates) are ALL luck and no skill. There is NOT inherent luck/skill ratio for ANY stat without defining the sample size!

Fair warning reader: if you’re just here for the baseball, skip this comment.

MGL, I’m not sure on whether Excel can do ICC. I doubt it, but then again, I have very little experience with Excel. I use SPSS (Statistical Package for the Social Sciences… I am a psychologist after all!), which calculates it with a few clicks of the mouse. I believe SAS can do that as well, although SAS is a nightmare program. ICC is a small piece of a family of analyses known as mixed linear models, so you’d probably need something more specialized.

The reason for taking the log of the odds ratio is that many statistical tests (at least inferential hypothesis ones) assume a normal distribution of the variables. Rates and percentages are not normally distributed, but taking the log of the odds ratio makes the distribution normal enough to work with. The natural log is preferred becasue the equation of the normal curve has the natural base (e, 2.7818) in it. Best to keep things in the family.

The reason that ICC outstrips using a huge dataset with multiple data pairs (so Clemens from 00-01, 01-02, 02-03) is twofold. One is that using overlapping data doubles introduces some ugly independence problems. Clemens is in there three times and his 2001 season is in there twice. The ideal bivariate correlation should have complete independence of its observations. Odd-even split half techniques like BP does solves for the overlapping years, but don’t solve for the same player being in there more than once. To solve for the problem, you need more than a 2×2 covariance matrix like the simple Peason bivariate model gives you. ICC allows you to do that. It’s not that yty will give you awful results. We generally traffic in questions that we just need a “close enough” answer, and yty is easier to calculate.

The other reason is that the particular ICC technique that I use (there are many, I use AR1, or auto-regressive first order) is derived from a covariance matrix that incorporates auto-regressive terms. This means that it looks at the development of the stat in sequential order. We can expect that as young players get older, there will be some improvement from year to year in their stats due to growing up. AR1 adjusts for that and then takes a look at the variance once that’s accounted for.

As to the word “consistent”, let’s say that we had two variables where R = 0. No correlation at all. All variance is luck and there’s no skill involved. Sure, everyone would cluster around the mean and eventually settle into the mean (or if there’s a skill involved, to their true skill level) given a billion trials. The problem is that’s not how baseball seasons are played. I don’t often come out specifically and say it, but I usually work within the confines of the player-season since that’s the most common unit of analysis in baseball (despite there being variation in how many PA’s or BF’s each batter/pitcher gets). I assume that it’s clear from context, but perhaps I should be a touch more explicit.

Thanks!