# Stats 204: The proximity matrix OR Re-visioning similarity scores

I suppose that when Bill James invented the similarity score, it was an attempt to say “Who exactly is this guy like?”  Is he the second coming of Joe DiMaggio (the power hitter who never strikes out), or is he the second coming of Dave Kingman (the power hitter who strikes out a little more often)?  Maybe he’s the second coming of Tommy Hinzo.  How can we tell.  Mr. James put together a formula that attempted to answer exactly that question.  The formula itself is based on a fairly simple system of “start with 1000″ and subtract points for differences in various statistical categories.  It’s not an awful system and generally produces some decent comparisons, but mathematically, we can do better than that!
Let’s pretend that there are only two stats in baseball that matter: walks and strikeouts.  We might use raw numbers of BB and K, but it makes more sense to put them into rate form.  We might classify players, in a very rough way, as being players who neither walk nor strikeout much, players who walk and strikeout a lot, players who strikeout a lot, but don’t walk much, etc.  If we want to get more fine-grained, we can start saying medium or medium-low, etc.  Or if we want to find the player whose BB and K rates match most closely, we can start digging through the data.  If Player A strikes out 15% of the time and walks 7%, then Player B who strikes out 14.8% of the time and walks 7.1% is a good match.  Player C who strikes out 23% of the time and walks 5% isn’t a good match.  But, how good a match… or a non-match is he?  And what do we do when we get beyond two stats of interest.  How do we account for walks, strikeouts, and home runs, singles, or anything else for that matter?
Enter the proximity matrix.  Let’s go back to our “walks and strikeouts only” example.  We could plot walk rate and strikeout rate on a standard two-dimensional axis (graph paper), and label all the players.  They we could measure (with a ruler!) which player is the closest to any other player.  That works great when there’s only two variables.  Three dimensional graph paper (for three variables) is harder to come by, and by the time we get to four variables, well now we’re into hyperspace.  (Yes, I love Star Trek too.)  Fortunately, mathematics isn’t bound by such constraints, and it’s possible to calculate the distance between a point in four (or more, there’s no limit) dimensions.  It’s called the squared Euclidean distance.  In fact, we can get a matrix of how far away every player in our sample is away from every other player.  That’s the lovely thing about computers, they do all the heavy lifting, and do it in rather short order.
And we can use whatever criteria or stats are of interest.  Want to look at player height and weight?  Want to look at career OBP and SLG and do it up to age 29?  Want to include every major leaguer ever?  Want to look at projected stats?  That’s fine.  Your CPU will groan a little more, but it can be done.  It’s just an engineering problem.
So, let’s run a little example.  Let me take the 2007 seasonal stats and calculate K rate, BB rate, and HR rate (all per PA), and BABIP.  I kept it to those hitters who had 200 PA or more (even though I spent way too much time arguing that more than 200 PA were needed for BABIP to be reliable enough to use… I’m just illustrating here), leaving me with 341 players.  I asked my computer to give me a proximity matrix.  (Technical note: I re-scaled everything to a range of -1 to +1, which mathematically makes things better.)
Then I tried to post this matrix so that everyone could see it.  The problem is that only 256 variables can be put into an Excel file (there are 341 players here), and when I tried to post it as pure text, the file reached 578 KB in size.  Google docs has a limit of 500 KB for text files.  If anyone wants the document, just e-mail me.  I prefer to keep everything I do open-source.
To give you an idea though of how it might work, and again only using the four stats above (more on that in a minute), let’s look at recent free agent debate-starter, Torii Hunter.  Whom, in terms of 2007 performance, did Torii most resemble?  Hunter hit a HR 4.3% of the time, struck out 15.5% of the time, walked 6.2% of the time, and had a BABIP of .306
Top 5 matches:

2. Brandon Phillips (4.3%/15.5%/4.7%/.307)
3. Alex Gonzalez (3.7%/17.4%/5.6%/.301)
4. Damien Easley (4.6%/16.1%/8.7%/.297)
5. Ryan Garko (3.9%/17.4%/6.3%/.322)

You’ll notice that none of those gentlemen are center fielders by trade, which is something that James’s system does take into account, however imprecisely.   It’s my understanding that a categorical variable (primary position) can be entered into the matrix and that can be controlled for.  (I used hierarchical clustering… I believe that would be two-step clustering.)
Now, I picked these four stats because they were easy to calculate and they do a decent enough job of encapsulating a player’s performance over a year, and that was all I needed for a quick example.  I’m fully expecting that the careful reader out there is already thinking “But those aren’t the best 4 stats.  You need to include/take out/replace….”  And that’s fine.  In fact, I’m counting on it.  It’s an interesting question.  What suite of stats would work best in here?  What stats would fully encapsulate a player’s abilities?  In other words, when you compare a player to some other player, what type of criteria do you use to make the comparison?  Does it depend on the question you’re trying to answer?  Pitchers?  Defense?  Hmmm…

### 23 Responses to Stats 204: The proximity matrix OR Re-visioning similarity scores

1. Pizza Cutter says:

It’s a similar process, although with a different application. The trick is the weighting (in my example, by default, all the stats got equal weights. Things can be made different). As to the independence of the metrics, I completely agree there. It makes no sense to use AVG and OBP, since for the a good part, they duplicate each other.

2. tangotiger says:

If you send me the file to my tangotiger.net email address (put tom in front of it), I can post it for you. I get a sizable bandwidth allocation.
I use these categories:
http://www.tangotiger.net/agepatterns.txt
(Go to the bottom of the page for the explanations.)
Basically, I make every metric independent of the other, which works great for binominal applications. Whether this is actually how to evaluate players, I don’t know. But it seems logical. This is Voros’ idea.
The way I do it is to figure out the number of SD the player is from the league mean for each metric.
I weight each metric as I see fit (usually to the requirements of the study).
And I simply add up the weighted variances.
Would this process be similar to yours?

3. tangotiger says:

Agreed on the weighting. If you are interested in looking for similar players to Vince Coleman, you may insist that the speed components (3b per 2b+3b and sb per sbOpp) be weighted much more than you otherwise would, because you are really interested in the speed players mostly.
You can also “flip” the sign if you are looking for guys who are fast like Coleman, but are power hitters. So, the closer the player is to Coleman’s HR rate, the farther you will actually make him.
Controlling the weight and the sign is very powerful.
The last one that I struggle with the most in number of PA. You could end up with two guys being very similar, but one has a 20-yr career, and another has a 10-yr. In that case, I’m trying to figure out the best way to use this. I could simply divide the distance in performance by the distance-in-years-plus-constant.
So, if I have three guys, compared to Player1 as:
Player, diff in performance, years played
1, zero, 10
2, .500, 20
3, .600, 8
I would think that Player 3 is closer to Player 1 than Player 2 is.
Something like
2, .500*(20-10+15)
3, .600*(10-8+15)
That would put player 3 above player 2. I’m just not sure how to handle that part.
Any thoughts?

4. dan says:

How about using ISO or PrSLG and HR% for power, speed scores for… um, speed…, K% and BB%, GB% and FB%. That’s quite a bunch, but I think it covers most of what you’d want to know about a batter.

5. Matt Souders says:

Couple things…you deinfitely need to make things league relative if you want to do any kind of historical comparison, and you definitely need to regress to the mean so that Damian Easley doesn’t look similar to Adrian freakin’ Beltre. But otherwise…yes, this is a better similarity method than James uses.
I use something called goodness of fit testing to rate player similarity over the course of their careers and not just over the most recent season. Goodness of fit testing allows me to see how similar a player’s progression of production is to other progressions in history…when you’ve got a player who is not very good and who suddenly has a good year, the goodness of fit test will find other players who were not very good and then suddenly had a good year as similar candidates, for example.

6. Sean Smith says:

Age is a big one, at least if you are thinking about projections at all. Use that and you won’t get Easley and Beltre on the same list.

7. Pizza Cutter says:

Tango, so long as the number of PA’s represents a reliable sample for the stats in question (see my post from last week on the subject), unless you’re specifically concerned about the length of career as a measure of performance, I wouldn’t be so much concerned with length of career/number of PA. For example, let’s say you’re using OBP or somesuch, and you’ve got two players who both have a few thousand PA, although one has 10,000, while the other has 5,000. Despite the one having half the experience, OBP is going to be pretty stable for both. (I suppose there might be the issue of aging patterns… the guy with 10,000 is likely to be a fair bit older than the 5,000… hmmmm…)
There’s also the issue of what comparisons you’re trying to make. Is this “career to equivalent age”, seasonal numbers, etc.?

8. Pizza Cutter says:

Matt, the Beltre/Easley comparison made me giggle when the computer spit that one out. For what it’s worth, you can see that their rate stats for this year alone are similar, although Easley had 218 PA this past year while Beltre had 639. Beltre’s performance will be much more stable. RTM would actually make a lot of sense in this case.

9. tangotiger says:

With respect to age, what I do is not look at a “career-to-date”, but rather a Marcel-type weighted career-to-date.
For example, I’ll take 100% of year T, 80% of year T-1, 64% of year T-2, and on and on. This way, what a guy did 10 years ago won’t have anywhere near the same weight as what he did last year.

10. Pizza Cutter says:

Makes sense for the purposes of forecasting. I could see people who want to make HOF arguments wanting everything weighted equally.

11. Zach says:

I did some stuff similar to this back when there was still a controversy about whether Bonds took PEDs. I would suggest weighting each offensive event by its linear weights run value, so that your distance metric has well defined units of runs/plate appearance in all dimensions.

12. Pizza Cutter says:

Zach, at that point, am I not basically just comparing players based on linear weights/baseruns?

13. tangotiger says:

No, because you can have two guys with the same Linear Weights (say Tim Raines and Mike Schmidt), but get there in different ways.
What Zach is suggesting is that you weight each component relative to the impact they have to generating runs. A guy can be 5’7″ and another be 6’10″, but if that has almost no bearing on the final outcome, then you can make Billy Wagner and Randy Johnson each other’s comps.

14. dan says:

Zach,
In 2006, Nick Swisher and Jose Reyes were almost identical in terms of WPA, RC/27, BRAA, SLG, and even BABIP. Clearly they didn’t achieve similar numbers in those categories playing the same style of ball.

15. Zach says:

Dan, that’s a good example problem to use.
In 2006,
128 1B, 30 2B, 17 3B, 19 HR, 53 BB, 54 SB and 17 CS in 700 PA
80 1B, 24 2B, 2 3B, 35 HR, 97 BB, 3 SB and 2 CS in 653 PA
I apply linear weights of .47 per 1B, .78 per 2B, 1.09 per 3B, 1.4 per HR, .33 per BB, .6 per CS and .5 per (AB-H)
Working this out for singles,
Reyes produces .0859 R/PA from singles
Swisher produces .0576 R/PA from singles
So the difference from that category is .0283 R/PA. I calculate the square root of the sum of the squares of the distances which result from each of the linear weights categories, and get that Reyes and Swisher are distant from one another by .06 R/PA. Over 700 PA, that works out to 42 runs. So despite the similarity in their overall production, I feel comfortable in saying that the way they achieved that production was extremely different, and they are not actually similar players at all.

16. Zach says:

You could achieve similar results by assigning points to different categories, a la Bill James, but I think approaches like that depend on having James’s freakish feel for baseball statistics. The main selling point of the linear weights is that, by converting all of the possible offensive events to runs, you have ready-made weights that give a sensible answer, and which don’t reply on having an intuitive grasp for whether Jose Reyes is more comparable to Roberto Alomar or Edgar Renteria.

17. Zach says:

And of course, if you want to compare seasons or careers, you shouldn’t divide each category by plate appearances. Then your distance between two players will have units of runs.

18. Pizza Cutter says:

Ah ha! I now get what you were saying earlier. You’re comparing how they generated their runs, as opposed to how many runs they generated.
Tango, I wonder what it would look like if you put the Fans’ Fielding Scouting Report into this.

19. tangotiger says:

Right, you could do the same thing. I in fact do do that. You can have two fielders be an overall “50″, but one might be a fast runner, weak arm, and another might be the reverse. I take the sum of squares of each one, but weighted based on how much run impact each trait has (exactly what Zach is talking about). I do away with position, since it doesn’t matter where ARod or Erstad plays, but rather how he plays.
I did some simple math to convert the distances into a similarity score, bounded at 100.

20. Pizza Cutter says:

The suspense is killing me. What did you find?

21. Pizza Cutter says:

Tom, I believe that’s pronounced DH.

22. tangotiger says:

?? Perhaps you are not aware, but I’ve been running comps on fielders for two or three years now:
http://www.tangotiger.net/scouting/sim2007_5406.html
Click any player’s name, and you get his set of comps.
If you go to the Brewers page, and click Ryan Braun, you’ll find that he has NO similar players, fielding-wise. That is, his combination of skillset at the MLB level is unique. We have no idea where he’s better off on the field, but RF is probably the best one.

23. tangotiger says: