# The measure of a man, Part IV

Some time ago, I introduced the concept of the proximity matrix as a possible way to generate similarity scores.  A proximity matrix is a nice side effect of a method called cluster analysis, which is a way to determine which two entries in a data set are most similar to one another.  It’s actually the way that a lot of dating sites work.  They measure several personality characteristics about you, and then look for the people who are closest to you in the proximity matrix.

At the time that I wrote the original proximity matrix piece, I used silly stats like AVG and RBI just to demonstrate how it worked.  The idea goes something like this.  Suppose that I had a data set in which there were only two factors that I cared about, power and speed.  There are some players who have a lot of power but no speed.  There are players who are the reverse.  There are some who are kinda middling on both.  You can imagine a two axis graph in which power and speed (however you want to define them) are graphed.  We might be able to draw circles around a group of players who fit into a “type” (high speed/low power; high speed/medium power, etc.)  Of course, the bigger the circle, the less specific the “type”, but the smaller the circles, the more groups you have to deal with, until it becomes unwieldy.  The goal is to find a happy medium between a few big, but heterogeneous groups and a bunch of small, but more specific groups.

How to tell who goes together?  Well, on a two-dimensional graph, you only need the Pythagorean theorem (the actual one, not the Bill James one) to determine the distance between any two points.  But, mathematically you can add as many dimensions as you want, and you can use whatever variables you like.  It works on something called squared Euclidian distance.  But with what set of variables shall we measure a man?  Oh right… (part I, part II, part III)

The four variable structure that I’ve created (Ichiro-Howard, contact, risk, solid contact) is a particularly suited to the rigors of the proximity matrix.  The numbers are engineered to be stable which means these are actually skills (in theory anyway!).  I also designed them to be orthogonal (not correlated) to one another.  We are getting a read of four genuinely independent skills, as opposed to measuring HR and RBI, which are correlated with one another.  Plus, they’re already set up so that they have the same mean and standard deviation.

So let’s dive in, shall we?  Let’s start at the top.  Who, in 2008, had the profile closest to Albert Pujols?  I bet a list of names just came into your head, most likely starting with Alex Rodriguez*.  The connection is obvious.  Pujols’s performance is most similar to Rodiriguez*, if for the fact that they are currently the two best hitters in the game.  It’s also not what these similarity scores are about.

Consider:

 Player Ichiro-Howard Contact Risk Solid Contact Pujols -.73324 1.47552 -.71220 1.72005 A-Roid* .00000 (sic) -1.00445 .02606 1.42174

(All numbers 2008; mean = 0, SD = 1)

Pujols clearly favors the fly ball with little speed (the Howard end), while A-Rod* is actually pefectly balanced between the two extremes.  Pujols has excellent contact skills, while A-Rod is a standard deviation below the mean.  Pujols is a more reluctant swinger and doesn’t seem to take many risky swings.  A-Rod is about league average on that one.  Both are good at hitting the ball a long way.  We generally only look at that last sentence, but it’s clear that they are two different types of hitters.

So who’s actually similar to Pujols?  The top 5 are listed below, in order of similarity.

 Player Ichiro-Howard Contact Risk Solid Contact A. Pujols -.73324 1.47552 -.71220 1.72005 J. Hairston -.76531 1.67798 -.19744 1.84334 B. Giles -.37557 1.62536 -.73731 1.26229 S. Casey -.58657 .93159 -.64673 2.52618 A. Ethier -.38172 1.25153 -.15381 .98470 I. Kinsler -.92335 .74409 -.21759 1.22831

Jerry Hairston?  For what it’s worth, Hairston hit .326/.384/.487 in 2008, although with some major platoon splits.  His solid contact number, which I’ve previously found is the least stable number of the four factors, jumped from -1.60 to 1.84 from 2007 to 2008.  The year before that (2006), he’d been at -1.59.  Hairston is an outlier who had a very lucky year.  Casey is something of a slow plodder, but is clearly half a standard deviation behind Pujols in contact.  His big number on solid contact comes from a high LD% that was over 100-some PA’s last year.  Giles is less of a big fly hitter and doesn’t make as good solid contact.  However, he prefers to sit back and wait (like Pujols does) and prefers a big flyball style (although less so).  In other words, Brian Giles is a poor man’s Albert Pujols.  A very poor man.

Ethier is in 4th place, and you might notice that he’s half a standard deviation behind Pujols in three categories.  Kinsler has the same problem.  Ethier might have a leg up though.  His contact skills will likely get better with age, and he will take fewer risks as he ages, bring him more in line with Pujols, so if there’s someone who’s got a chance to fit the Pujols mold, it’s Ethier, at least according to this.  But, may I offer an alternate conclusion?  There’s no one quite like Albert Pujols!

What about a guy like Evan Longoria?  His top 5 comparables from last year: Nelson Cruz, A-Rod*, John Bowker, Brad Hawpe, and Fernando Tatis.  Tatis had something of a lightning in a bottle year last year, but that’s not a bad list of comparables.  Good power, lots of K’s… makes sense.

 Player Ichiro-Howard Contact Risk Solid Contact E. Longoria -.44246 -1.07672 .20784 1.80886 N. Cruz -.26621 -.85020 .60302 1.47288 A. Rodriguez* .00000 -1.00445 .02606 1.42174 J. Bowker -.19865 -.52904 .02202 1.46338 B. Hawpe -.44018 -.84058 .89830 1.68557 F. Tatis .17184 -.66749 .23918 1.65625

Let’s do one more.  Raul Ibanez.  His Top 5.

 Player Ichiro-Howard Contact Risk Solid Contact R. Ibanez -.80166 .62130 .24022 .67452 C. Coste -.74362 .22284 .12103 .95755 J, Morneau -.85034 .91887 .55823 .17808 S. Rolen -1.39143 .86767 .40410 .54186 J. Kent -.46074 .88277 .81364 .59200 I. Kinsler -.92335 .74409 -.21759 1.22831

Again, no perfect matches, but a bunch of guys who like fly balls, but also make a lot of contact, which suppresses power.  These guys hit a lot of doubles.

Voila!  A good solid way to compare hitters to one another that doesn’t involve hazy qualitative judgments of player abilities.

### 11 Responses to The measure of a man, Part IV

1. Millsy says:

This is well-done. I imagine this was a lot of fun to put together, too. Any idea on the application of this stuff? It’s really fun to find out who’s similar and everything based on more innate abilities, but as one who called for “Clinical Sabermetricians”, what do you see as the possible uses for this? Just a curiosity.

2. Matt Swartz says:

This really is incredible and extremely interesting. I have enjoyed the whole series, though I haven’t been posting that each time. I know you mentioned it was to be a four-part series, but I really hope you keeping posting about it. I’d love to hear more about which players are comparably to which.
I guess the goal is to lead into doing player projections, and more properly evaluate aging curves by figuring out what skills actually age at what rates? I can’t remember if you mentioned this already, but are aging curves very similar for players in similar groups? I really wouldn’t be surprised if they were, and it seems like a lot less clumsy than the B-Pro B-Ref numbers. I’d also imagine that as these statistics stabilize more quickly, you can analyze players with fewer PA.
Thanks for posting these. It must have taken a lot of hard work, but it really is incredibly well done and interesting.

3. Pizza Cutter says:

Millsy, not everything has to have a clinical bent. (Besides, I started this series before I wrote that.) Actually, Matt anticipates my thinking a little bit. This is the beginning of research on why different types of players age differently and eventually, might be incorporated into a projection system.
But since you mention the clinical bent, those risk and contact numbers could probably be coached a bit.

4. Millsy says:

I agree not everything must be clinical. I hope you didn’t take the comment as some sort of criticism, I was just curious what you had in mind. I haven’t seen a lot of this type of analysis used in baseball, and I’ve thought to myself ways in which things like Cluster Analysis, PCA, etc. can be applied. I’m glad to see someone delve into this area. I’m guessing that with you psychology/social science training, you have a lot more experience with these methods than many of those trained on the side of Econometrics. I’ve enjoyed this and I think it could open some eyes to other techniques that have been overlooked in the past.

5. Pizza Cutter says:

The prolem is there’s a bold button and an itallics button, but not a “self-aware sarcasm” button ;-). Thanks for the kind words.

6. Shane says:

Great article, do you plan on posting the data so we can all see who our favorite players compare to? :)

7. jw says:

Colour me curious. I would love to see the code for that cluster analysis you’re doing. My Prof’s major work right now is on multi-model image registration by clustering and it sounds remarkably similar to what you’re doing now. Do you observe clusters of similar players with a lot of space in between? Or is it pretty well distributed?

8. Pizza Cutter says:

I can probably figure out a way to get that file up. I don’t know if Google docs will hold is though. As to the code, I use SPSS, and in this particular case, I didn’t really need code. (From a strictly CPU time point of view, this took almost no time to run.) I just used the graphical interface and asked SPSS to run the cluster analysis while saving the prox matrix.

9. Samg says:

Did you only do 2008? Would it be possible to run career numbers? I would love to see them. By the way, I love the metric.

10. Pizza Cutter says:

I only ran 2003-2008 because those are the years that I have all the retrosheet data that I need to get the ten initial factors that I need. If I wanted to do more years or career stats, I would need to re-conceptualize what should go into the four factors. It’s do-able, although I’d need a slightly different conceptual framework.

11. Zenaide says:

Hello. I have lost friends, some by death… others through sheer inability to cross the street.
I am from Angola and now study English, tell me right I wrote the following sentence: “Are you looking for a lesbian kiss movies? We got a lots of them! Movies tagged as lesbian kiss total movies found – viewing from to.”
Thank you very much 8-). Zenaide.