# A Detailed Comparison of Defensive Metrics

February 3, 2006 10 Comments

**Introduction**

Today on The Hardball Times, I have an article up about various defensive metrics. This is a continuation of that article, a comparison of those metrics mentioned there to edited UZR.

First of all, let me explain why I’m running this test, and why I chose the parameters that I did. I think that UZR is the gold standard of all defensive metrics. It’s ocassionally wrong, but it is the best-constructed system that I know of. So, in my opinion, all other defensive metrics should be evaluated against UZR, until one that is equal or better in design comes out. That’s why I’m comparing these metrics to UZR. The metrics I will refer to in this article, by the way are: The Probabilistic Model of Range (PMR), David Pinto’s metric — I’m testing both his models; Zone Rating (ZR), converted into runs above average by Chris Dial; Davenport Fielding Translations (DFT), Clay Davenport’s fielder ratings on Baseball Prospectus; and Range, which is my system.

I am, however, editing UZR before I do any tests.. Just based on standard error, UZR will ocasionally be off — this isn’t an error in the construction of the system so much as a mathematical certainty. So, in my opinion, there’s nothing wrong with editing UZR ratings *if there is substantial reason to suggest that a UZR rating is incorrect*. What’s tougher is deciding on a method of editing: It’s important to choose a method that is both stringent and objective. Here’s what I did.

First, I standardized all ratings using z-scores. A z-score measures how many standard deviations away from the mean a number is. Because the different systems have different spreads in their ratings (Range ratings can get a little too high and a little too low, while Zone Rating is bunched near the mean), it’s not a good idea to use their run ratings in any kind of editing system. Z-scores allow us to avoid that. Next, I average out the two PMR models’ z-scores for the player in question, so that PMR does not have a disproportionately large impact on the editing algorithm. I then add that to the z-scores from the other three methods being compared, and divide by 4 for an average z-score. Finally, I subtract the UZR z-score from that average z-score. If the result is greater than 1, I remove that player from the sample, as that is my cutoff point. In essence, what that means is that if a player’s UZR is more than 1 standard deviation away from his average rating among the other systems, he is removed from the sample. While 1 UZR standard deviation will be different at different positions, overall it is equal to 10 runs, so the difference between what UZR is saying and what the other metrics believe will be at 10 runs or so, a substantial difference. At all positions excluding right field, only 10 out of 137 players, or 7.3% of the sample, is edited out. I’ll discuss right field later.

A little more about my test: I looked at edited samples at each position. My cutoff point for inclusion in this study was 725 Innings played, or about half-a-season. That’s reasonable, in my opinion. It gives me a nice balance between decent playing time and decent sample size, which is going equal about 20 players at each position. My method of comparison is correlation. Correlation tells us how well to metrics track each other. A correlation of 1 is perfect, and means that a positive change in one metric will lead to an equivalent positive change in the other. A correlation of -1 is perfect in the opposite direction: A correlation of -1 means that a positive change in one metric will lead to an equivalent *negative* change in the other. For the purposes of this study, I’d like to now define what I think a good result will be. In my opinion, a correlation fo .7 or better at any position and especially overall would be considered excellent, .6 or better would be good, and .5 or better would be okay.

Now let’s look at the results. I’m not going to list all the players I looked at here, though I could list them on another page and link to it if people are interested. I will list the size of the edited sample, any players that may have been deleted, the correlations between each metric and edited UZR at each position, and discuss what those correlations mean and why a player was deleted, if any player was at that position. Let’s get started.

**Results**

__1B (n =23, Deletions = Daryle Ward)__

**Range, correlation (r) =** .58

**PMR Original, r =** .56

**PMR Alternative, r =** .63

**DFT, r =** .43

**ZR, r =** .83

Zone Rating is by far the best here. That seems about right; first basemen don’t cover a lot of ground so ZR shouldn’t have many problems here in terms of plays made on balls out of zone. Beyond that, the two PMR models perform about equally to Range, which is a great accomplishment, in my opinion, since PMR knows how many groundballs a player got to while Range can only estimate. I’d like to make two notes at this point: (1) For PMR in the infield, I used the ground ball-only numbers, which makes sense since all the other metrics except for possibly DFTs don’t include infield flies or infield line drives, which is correct in my opinion, and (2) That the alternate PMR model is one in which Park Factors are calculated using only visiting player data. Anyways, the correlations are pretty self-explanatory, but I do want to address Ward’s deletion.

UZR puts Ward at +10 runs, which means that he was better than about 90% of all players in baseball, according to UZR. PMR puts him in the 71st percentile, ZR puts him in the 57th, Range says 40th, and DFTs say that he’s better than only 19% of all MLB first basemen. So there’s disagreement across the board. In my opinion, it’s just much more likely that he’s somewhere near average (40th-71st percentile) than as good as UZR says or as bad as DFT does. More so, since Zone Rating correlates so well with UZR at first, the large difference here makes me suspect his UZR rating further.

__2B (n =20, Deletions = Robinson Cano)__

**Range, r =** .67

**PMR Original, r =** .80

**PMR Alternative, r =** .83

**DFT, r =** .59

**ZR, r =** .79

While Range does okay here, it’s clear that PMR and Zone Rating just do better. The general agreement level at second base is very high, with the only Robinson Cano being edited out, and even there, all systems but UZR agree that he sucked last year. His best non-UZR rating still puts him in the bottom third of MLB second basemen, while UZR puts him in the *top* quarter. There’s just no way that’s accurate. It reminds me of last year’s UZR ratings, where Miguel Cairo, who played second for New York in 2004, had something like a +22 UZR, when every other system hated him. Maybe it has something to do with the Yankees’ ball in play distribution, but the simple answer is that Cano is not nearly as good as UZR says he is.

__3B (n =20, Deletions = Mark Teahan, Troy Glaus)__

**Range, r =** .80

**PMR Original, r =** .59

**PMR Alternative, r =** .50

**DFT, r =** .58

**ZR, r =** .59

Yes, I’m shocked. While all the other metrics are stuck in the same rut, Range, a non-PBP metric, absolutely nails it at third base. That’s strange because it’s generally tougher to estimate chances at a corner position, where there are more extreme ball in play distribution. Estimates of chances at third and first base should generally be off a little more than in the middle infield, yet here, Range is the only metric that does very well.

The deleted players: Only Zone Rating saw Teahan as starkly negatively as UZR did, put him in about the 1st percentile in terms, in other words saying that 99% of all third basemen were better fielders last season. I doubt it; Range, PMR, and DFT are in agreement that he was somewhat below average, but certainly not *bad* or *terrible*. More so, according to Tangotiger’s Fan Scouting Report, fans see Teahan as a solidly *above* average player, so it’s hard to imagine him being the worst. With Glaus, PMR and UZR are in total agreement which makes me think that maybe his UZR rating is in fact right, given that these are the two most sophisticated metrics in this test. On the other hand, Zone Rating and DFTs see him as mediocre, and not terrible, while Range thinks Glaus is actually slightly *above* average. Glaus’ Range rating differes starkly from his UZR, but the agreement between UZR and the other three metrics is almost low enough to be edited without his Range rating, so there is no need to think that it’s Glaus’ Range rating that is pushing him out of the sample, and pushing Range’s correlation with edited UZR up.

__SS (n =23, Deletions = Jack Wilson, Miguel Tejada, Orlando Cabrera)__

**Range, r =** .80

**PMR Original, r =** .80

**PMR Alternative, r =** .78

**DFT, r =** .61

**ZR, r =** .73

Not much to say at shortstop except that Range again does great, and that all the systems look pretty solid. Even Fielding % would have worked okay here, because the nature of the position is such that even the simplest metrics don’t do too badly. Which is why it’s surprising that three shortstops managed to get taken out of the sample. Wilson is deleted because every metric thinks he’s either very good or great, while UZR says he’s a big, fat 0. UZR is simply wrong here. With Tejada PMR, ZR, and DFT say he’s just about average while Range says he’s slightly above, and UZR makes him out to be a Gold Glover. While Tejada’s great defensive performance last year makes his UZR rating this year seem more plausible, the fact that both PMR and ZR disagree *completely* makes me think that UZR is wrong again. O-Cab is a more confusing case. UZR says he was among the top-10% of all MLB shortstops defensively last year, which is consistent with his reputation. Zone Rating and DFTs also like Cabrera, though not nearly as much as UZR. Meanwhile, PMR and Range say he was somewhat below average. In the end, I think that the evidence points towards average being the correct rating for Cabrera, not top-10%.

__LF (n =20, Deletions = None)__

**Range, r =** .86

**PMR Original, r =** .83

**PMR Alternative, r =** .79

**DFT, r =** .52

**ZR, r =** .86

Feels nice, doesn’t it? Except for DFTs, all systems agree fully with each other and with UZR. So yeah, Coco Crisp is awesome and Manny Ramirez sucks. Let’s move on.

__CF (n =22, Deletions = Dave Roberts, Kenny Lofton)__

**Range, r =** .68

**PMR Original, r =** .73

**PMR Alternative, r =** .68

**DFT, r =** .75

**ZR, r =** .78

The high DFT correlation is an accident; the high correlation between UZR and ZR in Center Field is not. Center fielder ratings are going to be subject to a lot of discretionary flies, that either the center fielder, or someone else, could catch. So if the center fielder is making all the plays on those discretionary balls in play, he’ll have an inflated rating everywhere but in ZR, where the ball in a “shared” zone won’t count. Maybe UZR is able to capture enough of that so as not to inflate a player’s value compared to Range and PMR say. Maybe it’s something else; I don’t know. Certainly, the results for both Range and PMR are nothing to hang your head about.

Roberts does well in UZR and poorly in every other system. It’s possible that this is being caused by a Park Factor, but I doubt it; otherwise, it should show up in PMR. Lofton’s ratings are a bit scattered, but the fact is he was about average last season, not nearly as bad as UZR makes him look.

__RF (n =13, Deletions = Shawn Green, Jose Guillen, Trot Nixon, Jason Lane, Vladimir Guerrero, Juan Encarnacion, Nick Swisher, Emil Brown)__

**Range, r =** .65

**PMR Original, r =** .47

**PMR Alternative, r =** .52

**DFT, r =** .27

**ZR, r =** .69

I told you Right Field was interesting. Prior to editing the data here, four of the five systems had a correlation between -.05 and .10. Great… So why do defensive metrics—even those with play-by-play data—match up so poorly with UZR? Frankly, I don’t know. Maybe it has to do with the data sources for these various metrics, maybe Stats does something in Right Field that makes UZR ratings “weird.” Maybe…I actually just don’t know. So you may ask, should I even be editing if the disagreements in Right Field are so strong? I think so. I want to be consistent, one, and two, I don’t think that it’s just the failures of other systems causing this disparity. They all correlate about the same with Tangotiger’s Fan Scouting Report, thought DFTs are clearly the best in that sense. Anyways, I think the correlations above are roughly indicative of how each system matches up in Right Field with UZR, and you can see that ZR and Range are the best here; DFTs are useless.

__Overall Analysis__

First, here are the overall correlations between the various systems and edited UZR:

**Range, r =** .72

**PMR Original, r =** .71

**PMR Alternative, r =** .71

**DFT, r =** .54

**ZR, r =** .76

First, you can see that there is surprisingly little difference between the metrics if you discount DFTs, which are clearly worse. You’ll also find that there is little difference between the two PMR models. But I wanted to dig deeper. What if we regress the various models on UZR? How close can we get to the gold standard? I got the following formula—.58*Zone Rating + .238*PMR Alternative + 0.236*Range—which correlates with UZR at .87. That is remarkably good. The split of “credit” between the systems using Beta values is 44% ZR, 31% Range, 25% PMR. The mean error is a miniscule 3.99. The original PMR model and DFTs were not significant and were removed from the regression.

Using just Zone Rating and Range, I can get a .84 correlation, which is damn good as well; the formula is .667*ZR + .332*Range.

So there you have it: A full comparison of the systems. For a single season evaluation, any of the more advanced metrics here is perfectly acceptable. A combination, however, is best.

**Edits**: I meant to put these up yesterday, but I want to provide a couple links:

Jon Weisman wrote a great article on SI.com yesterday covering these same metrics.

Here’s the Baseball Primer discussion of this and the THT article.

I can’t belive no one comments on this stuff. Also, why not come out with something like UZRange to put a name to the most accurate system yet in determining a players’ worth defensively?

Why do people who create these systems hold on to their beliefs so strongly when they could do something as simple as what you just demonstrated?

No ONE PERSON ever has the right answer…even Gods like Bill James relied on randoms like Voros McCracken to introduce them to theories/ideas that hadn’t crossed their radar previously…

damn, ranting again.

Keep up the good work.

Personally, I find internal correlations between the various systems less persuasive than the correlation between each system and known team defence markers. Park-adjusted DER (using Tango’s figures for the adjustments) provides a good measure of team range. The sum of individual fielding data should compare well with park-adjusted DER. A comparison of this type is much easier with Davenport data than with the other systems.

[…] In the same vein, David Gassko of the Hardball Times compared the various metrics in articles located here and here. (via Baseball Musings) […]

Great article! I was hoping somebody would do this. I wanted to do something like this myself but don’t have all the data.

Lee

As Caleb says, no one person ever has the right answer. So rather than trying to come up with an ultimate method, perhaps we can combine all the methods and arrive at an aggregate ranking. What you would do is rank every player at a position on each metric separately. Then average the ranks to get a final ranking.

For example, a player ranks 1st on UZR, 2nd on PMR, 2nd on Gassko Range, 1st on DFT, 2nd on range factor and 7th on ZR. If you had just been looking at ZR, you would come to the conclusion that this player has just pretty good range while the other methods say he has elite range. His average rank is 2.5 which is still very high but not as high as a player who finished 1 or 2 on every single metric.

Yes, I’d throw range factor in there too . Although it has obvious flaws, it is the one method which does not depend on either theoretical probabilities or subjective evaluation. On the other end of the spectrum, you might even want to add Tangotiger’s fan fielding numbers to the mix.

Obviously, you could use some kind of weighted average instead of a straight average but I didn’t want to make my explanation more difficult than necessary.

Lee

[…] David Gassko’s math used in the Hardball Times article can be found at his own blog, Statistically Speaking. –> […]

I think Lee makes an interesting point, and I imagine that we’ll eventually get to that point – with a new fancy name of course that will include an M since I think that is the only letter not used in the defense acronym war so far.

hi david!!!

my GAWD i wish i had your brains. i been wanting to do something like this for a LONG time, but i had NO idea how to. like i don’t have ANY idea where to get a z-score from…

– i was REALLY glad you included tango’s fan scoring (and i am proud to be one of the scorers) to see how well us serious fans rated with our eyeballs compared to numbers.

one thing that is hard to include in defensive ratings is stuff like fielder #1 calling off fielder #2 in fielder #2’s zone. maybe yall brains can figger out a way to evaluate each chance or play and figger out how to link them together with numbers the way we do with our eyes

lisa

David,

Very interesting work. I particulaly like your regression of the models to UZR.

I am little uncertain of the value of correlating different imperfect models. How do we know the the assumption that UZR is the best is correct?

It’s the most robust model, and incorporates the most information. It’s not perfect, which is why I edit it, but it is the best.