I’ve received a lot e-mail about my Sammy Sosa article on The Hardball Times telling me that I must be a Sosa apologist, and that I’m ignoring the obvious fact that he used steroids. Here’s what’s funny about this: My original intent with the article was to show that he had indeed used ‘roids. Then I started doing some research, and, lo and behold, I found little evidence to back that claim up. So I published what I found. Now, I must be a Sosa apologist. Sabermetrics is about looking for the truth, not evidence that supports your opinion. And the fact is, it doesn’t look like Sosa used steroids. But I guess that just makes me an apologist.


Slammin’ Sammy

I have an article up on The Hardball Times today about Sammy Sosa. My opinion is that Sosa was probably not a steroid user. Why? Read the article. Among this I forgot to mention in it, but are nonetheless important, is that Sosa did not test positive last year (or in 2003, according to Jay Jaffe’s sources) and that he has not been connected to any steroid factory, like Barry Bonds and BALCO. So what’s your opinion? Do you think Sosa was a user, or clean? If he was a user, is he still a Hall of Famer? How much do you discount his accomplishments?

A query on equality

Pardon the silence, folks. Now that we’ve all had enough time to digest David’s excellent work on defensive metrics, I wanted to shift gears a bit and pose a question. This is part of a longer project I’m currently working on, and I would appreciate any thoughts you all might have here.
The question is this: How would you define parity in Major League Baseball?
I’m not looking for any one answer. If you want to give me a general answer, fine. If you want to give me a statistical answer, also fine. I’m just looking right now at what the range of opinion is among baseball fans on a topic that is sure to come up as the Basic Agreement negotiations near. I do appreciate any responses you have for me.

A Detailed Comparison of Defensive Metrics

Today on The Hardball Times, I have an article up about various defensive metrics. This is a continuation of that article, a comparison of those metrics mentioned there to edited UZR.
First of all, let me explain why I’m running this test, and why I chose the parameters that I did. I think that UZR is the gold standard of all defensive metrics. It’s ocassionally wrong, but it is the best-constructed system that I know of. So, in my opinion, all other defensive metrics should be evaluated against UZR, until one that is equal or better in design comes out. That’s why I’m comparing these metrics to UZR. The metrics I will refer to in this article, by the way are: The Probabilistic Model of Range (PMR), David Pinto’s metric — I’m testing both his models; Zone Rating (ZR), converted into runs above average by Chris Dial; Davenport Fielding Translations (DFT), Clay Davenport’s fielder ratings on Baseball Prospectus; and Range, which is my system.
I am, however, editing UZR before I do any tests.. Just based on standard error, UZR will ocasionally be off — this isn’t an error in the construction of the system so much as a mathematical certainty. So, in my opinion, there’s nothing wrong with editing UZR ratings if there is substantial reason to suggest that a UZR rating is incorrect. What’s tougher is deciding on a method of editing: It’s important to choose a method that is both stringent and objective. Here’s what I did.
First, I standardized all ratings using z-scores. A z-score measures how many standard deviations away from the mean a number is. Because the different systems have different spreads in their ratings (Range ratings can get a little too high and a little too low, while Zone Rating is bunched near the mean), it’s not a good idea to use their run ratings in any kind of editing system. Z-scores allow us to avoid that. Next, I average out the two PMR models’ z-scores for the player in question, so that PMR does not have a disproportionately large impact on the editing algorithm. I then add that to the z-scores from the other three methods being compared, and divide by 4 for an average z-score. Finally, I subtract the UZR z-score from that average z-score. If the result is greater than 1, I remove that player from the sample, as that is my cutoff point. In essence, what that means is that if a player’s UZR is more than 1 standard deviation away from his average rating among the other systems, he is removed from the sample. While 1 UZR standard deviation will be different at different positions, overall it is equal to 10 runs, so the difference between what UZR is saying and what the other metrics believe will be at 10 runs or so, a substantial difference. At all positions excluding right field, only 10 out of 137 players, or 7.3% of the sample, is edited out. I’ll discuss right field later.
A little more about my test: I looked at edited samples at each position. My cutoff point for inclusion in this study was 725 Innings played, or about half-a-season. That’s reasonable, in my opinion. It gives me a nice balance between decent playing time and decent sample size, which is going equal about 20 players at each position. My method of comparison is correlation. Correlation tells us how well to metrics track each other. A correlation of 1 is perfect, and means that a positive change in one metric will lead to an equivalent positive change in the other. A correlation of -1 is perfect in the opposite direction: A correlation of -1 means that a positive change in one metric will lead to an equivalent negative change in the other. For the purposes of this study, I’d like to now define what I think a good result will be. In my opinion, a correlation fo .7 or better at any position and especially overall would be considered excellent, .6 or better would be good, and .5 or better would be okay.
Now let’s look at the results. I’m not going to list all the players I looked at here, though I could list them on another page and link to it if people are interested. I will list the size of the edited sample, any players that may have been deleted, the correlations between each metric and edited UZR at each position, and discuss what those correlations mean and why a player was deleted, if any player was at that position. Let’s get started.
1B (n =23, Deletions = Daryle Ward)
Range, correlation (r) = .58
PMR Original, r = .56
PMR Alternative, r = .63
DFT, r = .43
ZR, r = .83
Zone Rating is by far the best here. That seems about right; first basemen don’t cover a lot of ground so ZR shouldn’t have many problems here in terms of plays made on balls out of zone. Beyond that, the two PMR models perform about equally to Range, which is a great accomplishment, in my opinion, since PMR knows how many groundballs a player got to while Range can only estimate. I’d like to make two notes at this point: (1) For PMR in the infield, I used the ground ball-only numbers, which makes sense since all the other metrics except for possibly DFTs don’t include infield flies or infield line drives, which is correct in my opinion, and (2) That the alternate PMR model is one in which Park Factors are calculated using only visiting player data. Anyways, the correlations are pretty self-explanatory, but I do want to address Ward’s deletion.
UZR puts Ward at +10 runs, which means that he was better than about 90% of all players in baseball, according to UZR. PMR puts him in the 71st percentile, ZR puts him in the 57th, Range says 40th, and DFTs say that he’s better than only 19% of all MLB first basemen. So there’s disagreement across the board. In my opinion, it’s just much more likely that he’s somewhere near average (40th-71st percentile) than as good as UZR says or as bad as DFT does. More so, since Zone Rating correlates so well with UZR at first, the large difference here makes me suspect his UZR rating further.
2B (n =20, Deletions = Robinson Cano)
Range, r = .67
PMR Original, r = .80
PMR Alternative, r = .83
DFT, r = .59
ZR, r = .79
While Range does okay here, it’s clear that PMR and Zone Rating just do better. The general agreement level at second base is very high, with the only Robinson Cano being edited out, and even there, all systems but UZR agree that he sucked last year. His best non-UZR rating still puts him in the bottom third of MLB second basemen, while UZR puts him in the top quarter. There’s just no way that’s accurate. It reminds me of last year’s UZR ratings, where Miguel Cairo, who played second for New York in 2004, had something like a +22 UZR, when every other system hated him. Maybe it has something to do with the Yankees’ ball in play distribution, but the simple answer is that Cano is not nearly as good as UZR says he is.
3B (n =20, Deletions = Mark Teahan, Troy Glaus)
Range, r = .80
PMR Original, r = .59
PMR Alternative, r = .50
DFT, r = .58
ZR, r = .59
Yes, I’m shocked. While all the other metrics are stuck in the same rut, Range, a non-PBP metric, absolutely nails it at third base. That’s strange because it’s generally tougher to estimate chances at a corner position, where there are more extreme ball in play distribution. Estimates of chances at third and first base should generally be off a little more than in the middle infield, yet here, Range is the only metric that does very well.
The deleted players: Only Zone Rating saw Teahan as starkly negatively as UZR did, put him in about the 1st percentile in terms, in other words saying that 99% of all third basemen were better fielders last season. I doubt it; Range, PMR, and DFT are in agreement that he was somewhat below average, but certainly not bad or terrible. More so, according to Tangotiger’s Fan Scouting Report, fans see Teahan as a solidly above average player, so it’s hard to imagine him being the worst. With Glaus, PMR and UZR are in total agreement which makes me think that maybe his UZR rating is in fact right, given that these are the two most sophisticated metrics in this test. On the other hand, Zone Rating and DFTs see him as mediocre, and not terrible, while Range thinks Glaus is actually slightly above average. Glaus’ Range rating differes starkly from his UZR, but the agreement between UZR and the other three metrics is almost low enough to be edited without his Range rating, so there is no need to think that it’s Glaus’ Range rating that is pushing him out of the sample, and pushing Range’s correlation with edited UZR up.
SS (n =23, Deletions = Jack Wilson, Miguel Tejada, Orlando Cabrera)
Range, r = .80
PMR Original, r = .80
PMR Alternative, r = .78
DFT, r = .61
ZR, r = .73
Not much to say at shortstop except that Range again does great, and that all the systems look pretty solid. Even Fielding % would have worked okay here, because the nature of the position is such that even the simplest metrics don’t do too badly. Which is why it’s surprising that three shortstops managed to get taken out of the sample. Wilson is deleted because every metric thinks he’s either very good or great, while UZR says he’s a big, fat 0. UZR is simply wrong here. With Tejada PMR, ZR, and DFT say he’s just about average while Range says he’s slightly above, and UZR makes him out to be a Gold Glover. While Tejada’s great defensive performance last year makes his UZR rating this year seem more plausible, the fact that both PMR and ZR disagree completely makes me think that UZR is wrong again. O-Cab is a more confusing case. UZR says he was among the top-10% of all MLB shortstops defensively last year, which is consistent with his reputation. Zone Rating and DFTs also like Cabrera, though not nearly as much as UZR. Meanwhile, PMR and Range say he was somewhat below average. In the end, I think that the evidence points towards average being the correct rating for Cabrera, not top-10%.
LF (n =20, Deletions = None)
Range, r = .86
PMR Original, r = .83
PMR Alternative, r = .79
DFT, r = .52
ZR, r = .86
Feels nice, doesn’t it? Except for DFTs, all systems agree fully with each other and with UZR. So yeah, Coco Crisp is awesome and Manny Ramirez sucks. Let’s move on.
CF (n =22, Deletions = Dave Roberts, Kenny Lofton)
Range, r = .68
PMR Original, r = .73
PMR Alternative, r = .68
DFT, r = .75
ZR, r = .78
The high DFT correlation is an accident; the high correlation between UZR and ZR in Center Field is not. Center fielder ratings are going to be subject to a lot of discretionary flies, that either the center fielder, or someone else, could catch. So if the center fielder is making all the plays on those discretionary balls in play, he’ll have an inflated rating everywhere but in ZR, where the ball in a “shared” zone won’t count. Maybe UZR is able to capture enough of that so as not to inflate a player’s value compared to Range and PMR say. Maybe it’s something else; I don’t know. Certainly, the results for both Range and PMR are nothing to hang your head about.
Roberts does well in UZR and poorly in every other system. It’s possible that this is being caused by a Park Factor, but I doubt it; otherwise, it should show up in PMR. Lofton’s ratings are a bit scattered, but the fact is he was about average last season, not nearly as bad as UZR makes him look.
RF (n =13, Deletions = Shawn Green, Jose Guillen, Trot Nixon, Jason Lane, Vladimir Guerrero, Juan Encarnacion, Nick Swisher, Emil Brown)
Range, r = .65
PMR Original, r = .47
PMR Alternative, r = .52
DFT, r = .27
ZR, r = .69
I told you Right Field was interesting. Prior to editing the data here, four of the five systems had a correlation between -.05 and .10. Great… So why do defensive metrics—even those with play-by-play data—match up so poorly with UZR? Frankly, I don’t know. Maybe it has to do with the data sources for these various metrics, maybe Stats does something in Right Field that makes UZR ratings “weird.” Maybe…I actually just don’t know. So you may ask, should I even be editing if the disagreements in Right Field are so strong? I think so. I want to be consistent, one, and two, I don’t think that it’s just the failures of other systems causing this disparity. They all correlate about the same with Tangotiger’s Fan Scouting Report, thought DFTs are clearly the best in that sense. Anyways, I think the correlations above are roughly indicative of how each system matches up in Right Field with UZR, and you can see that ZR and Range are the best here; DFTs are useless.
Overall Analysis
First, here are the overall correlations between the various systems and edited UZR:
Range, r = .72
PMR Original, r = .71
PMR Alternative, r = .71
DFT, r = .54
ZR, r = .76
First, you can see that there is surprisingly little difference between the metrics if you discount DFTs, which are clearly worse. You’ll also find that there is little difference between the two PMR models. But I wanted to dig deeper. What if we regress the various models on UZR? How close can we get to the gold standard? I got the following formula—.58*Zone Rating + .238*PMR Alternative + 0.236*Range—which correlates with UZR at .87. That is remarkably good. The split of “credit” between the systems using Beta values is 44% ZR, 31% Range, 25% PMR. The mean error is a miniscule 3.99. The original PMR model and DFTs were not significant and were removed from the regression.
Using just Zone Rating and Range, I can get a .84 correlation, which is damn good as well; the formula is .667*ZR + .332*Range.
So there you have it: A full comparison of the systems. For a single season evaluation, any of the more advanced metrics here is perfectly acceptable. A combination, however, is best.
Edits: I meant to put these up yesterday, but I want to provide a couple links:
Jon Weisman wrote a great article on yesterday covering these same metrics.
Here’s the Baseball Primer discussion of this and the THT article.