March 26, 2009 11 Comments
Some time ago, I introduced the concept of the proximity matrix as a possible way to generate similarity scores. A proximity matrix is a nice side effect of a method called cluster analysis, which is a way to determine which two entries in a data set are most similar to one another. It’s actually the way that a lot of dating sites work. They measure several personality characteristics about you, and then look for the people who are closest to you in the proximity matrix.
At the time that I wrote the original proximity matrix piece, I used silly stats like AVG and RBI just to demonstrate how it worked. The idea goes something like this. Suppose that I had a data set in which there were only two factors that I cared about, power and speed. There are some players who have a lot of power but no speed. There are players who are the reverse. There are some who are kinda middling on both. You can imagine a two axis graph in which power and speed (however you want to define them) are graphed. We might be able to draw circles around a group of players who fit into a “type” (high speed/low power; high speed/medium power, etc.) Of course, the bigger the circle, the less specific the “type”, but the smaller the circles, the more groups you have to deal with, until it becomes unwieldy. The goal is to find a happy medium between a few big, but heterogeneous groups and a bunch of small, but more specific groups.
How to tell who goes together? Well, on a two-dimensional graph, you only need the Pythagorean theorem (the actual one, not the Bill James one) to determine the distance between any two points. But, mathematically you can add as many dimensions as you want, and you can use whatever variables you like. It works on something called squared Euclidian distance. But with what set of variables shall we measure a man? Oh right… (part I, part II, part III)
The four variable structure that I’ve created (Ichiro-Howard, contact, risk, solid contact) is a particularly suited to the rigors of the proximity matrix. The numbers are engineered to be stable which means these are actually skills (in theory anyway!). I also designed them to be orthogonal (not correlated) to one another. We are getting a read of four genuinely independent skills, as opposed to measuring HR and RBI, which are correlated with one another. Plus, they’re already set up so that they have the same mean and standard deviation.
So let’s dive in, shall we? Let’s start at the top. Who, in 2008, had the profile closest to Albert Pujols? I bet a list of names just came into your head, most likely starting with Alex Rodriguez*. The connection is obvious. Pujols’s performance is most similar to Rodiriguez*, if for the fact that they are currently the two best hitters in the game. It’s also not what these similarity scores are about.
(All numbers 2008; mean = 0, SD = 1)
Pujols clearly favors the fly ball with little speed (the Howard end), while A-Rod* is actually pefectly balanced between the two extremes. Pujols has excellent contact skills, while A-Rod is a standard deviation below the mean. Pujols is a more reluctant swinger and doesn’t seem to take many risky swings. A-Rod is about league average on that one. Both are good at hitting the ball a long way. We generally only look at that last sentence, but it’s clear that they are two different types of hitters.
So who’s actually similar to Pujols? The top 5 are listed below, in order of similarity.
Jerry Hairston? For what it’s worth, Hairston hit .326/.384/.487 in 2008, although with some major platoon splits. His solid contact number, which I’ve previously found is the least stable number of the four factors, jumped from -1.60 to 1.84 from 2007 to 2008. The year before that (2006), he’d been at -1.59. Hairston is an outlier who had a very lucky year. Casey is something of a slow plodder, but is clearly half a standard deviation behind Pujols in contact. His big number on solid contact comes from a high LD% that was over 100-some PA’s last year. Giles is less of a big fly hitter and doesn’t make as good solid contact. However, he prefers to sit back and wait (like Pujols does) and prefers a big flyball style (although less so). In other words, Brian Giles is a poor man’s Albert Pujols. A very poor man.
Ethier is in 4th place, and you might notice that he’s half a standard deviation behind Pujols in three categories. Kinsler has the same problem. Ethier might have a leg up though. His contact skills will likely get better with age, and he will take fewer risks as he ages, bring him more in line with Pujols, so if there’s someone who’s got a chance to fit the Pujols mold, it’s Ethier, at least according to this. But, may I offer an alternate conclusion? There’s no one quite like Albert Pujols!
What about a guy like Evan Longoria? His top 5 comparables from last year: Nelson Cruz, A-Rod*, John Bowker, Brad Hawpe, and Fernando Tatis. Tatis had something of a lightning in a bottle year last year, but that’s not a bad list of comparables. Good power, lots of K’s… makes sense.
Let’s do one more. Raul Ibanez. His Top 5.
Again, no perfect matches, but a bunch of guys who like fly balls, but also make a lot of contact, which suppresses power. These guys hit a lot of doubles.
Voila! A good solid way to compare hitters to one another that doesn’t involve hazy qualitative judgments of player abilities.