Stats 204: The proximity matrix OR Re-visioning similarity scores
November 27, 2007 23 Comments
I suppose that when Bill James invented the similarity score, it was an attempt to say “Who exactly is this guy like?” Is he the second coming of Joe DiMaggio (the power hitter who never strikes out), or is he the second coming of Dave Kingman (the power hitter who strikes out a little more often)? Maybe he’s the second coming of Tommy Hinzo. How can we tell. Mr. James put together a formula that attempted to answer exactly that question. The formula itself is based on a fairly simple system of “start with 1000″ and subtract points for differences in various statistical categories. It’s not an awful system and generally produces some decent comparisons, but mathematically, we can do better than that!
Let’s pretend that there are only two stats in baseball that matter: walks and strikeouts. We might use raw numbers of BB and K, but it makes more sense to put them into rate form. We might classify players, in a very rough way, as being players who neither walk nor strikeout much, players who walk and strikeout a lot, players who strikeout a lot, but don’t walk much, etc. If we want to get more fine-grained, we can start saying medium or medium-low, etc. Or if we want to find the player whose BB and K rates match most closely, we can start digging through the data. If Player A strikes out 15% of the time and walks 7%, then Player B who strikes out 14.8% of the time and walks 7.1% is a good match. Player C who strikes out 23% of the time and walks 5% isn’t a good match. But, how good a match… or a non-match is he? And what do we do when we get beyond two stats of interest. How do we account for walks, strikeouts, and home runs, singles, or anything else for that matter?
Enter the proximity matrix. Let’s go back to our “walks and strikeouts only” example. We could plot walk rate and strikeout rate on a standard two-dimensional axis (graph paper), and label all the players. They we could measure (with a ruler!) which player is the closest to any other player. That works great when there’s only two variables. Three dimensional graph paper (for three variables) is harder to come by, and by the time we get to four variables, well now we’re into hyperspace. (Yes, I love Star Trek too.) Fortunately, mathematics isn’t bound by such constraints, and it’s possible to calculate the distance between a point in four (or more, there’s no limit) dimensions. It’s called the squared Euclidean distance. In fact, we can get a matrix of how far away every player in our sample is away from every other player. That’s the lovely thing about computers, they do all the heavy lifting, and do it in rather short order.
And we can use whatever criteria or stats are of interest. Want to look at player height and weight? Want to look at career OBP and SLG and do it up to age 29? Want to include every major leaguer ever? Want to look at projected stats? That’s fine. Your CPU will groan a little more, but it can be done. It’s just an engineering problem.
So, let’s run a little example. Let me take the 2007 seasonal stats and calculate K rate, BB rate, and HR rate (all per PA), and BABIP. I kept it to those hitters who had 200 PA or more (even though I spent way too much time arguing that more than 200 PA were needed for BABIP to be reliable enough to use… I’m just illustrating here), leaving me with 341 players. I asked my computer to give me a proximity matrix. (Technical note: I re-scaled everything to a range of -1 to +1, which mathematically makes things better.)
Then I tried to post this matrix so that everyone could see it. The problem is that only 256 variables can be put into an Excel file (there are 341 players here), and when I tried to post it as pure text, the file reached 578 KB in size. Google docs has a limit of 500 KB for text files. If anyone wants the document, just e-mail me. I prefer to keep everything I do open-source.
To give you an idea though of how it might work, and again only using the four stats above (more on that in a minute), let’s look at recent free agent debate-starter, Torii Hunter. Whom, in terms of 2007 performance, did Torii most resemble? Hunter hit a HR 4.3% of the time, struck out 15.5% of the time, walked 6.2% of the time, and had a BABIP of .306
Top 5 matches:
- Adrian Beltre (4.1%/16.3%/5.9%/.297)
- Brandon Phillips (4.3%/15.5%/4.7%/.307)
- Alex Gonzalez (3.7%/17.4%/5.6%/.301)
- Damien Easley (4.6%/16.1%/8.7%/.297)
- Ryan Garko (3.9%/17.4%/6.3%/.322)
You’ll notice that none of those gentlemen are center fielders by trade, which is something that James’s system does take into account, however imprecisely. It’s my understanding that a categorical variable (primary position) can be entered into the matrix and that can be controlled for. (I used hierarchical clustering… I believe that would be two-step clustering.)
Now, I picked these four stats because they were easy to calculate and they do a decent enough job of encapsulating a player’s performance over a year, and that was all I needed for a quick example. I’m fully expecting that the careful reader out there is already thinking “But those aren’t the best 4 stats. You need to include/take out/replace….” And that’s fine. In fact, I’m counting on it. It’s an interesting question. What suite of stats would work best in here? What stats would fully encapsulate a player’s abilities? In other words, when you compare a player to some other player, what type of criteria do you use to make the comparison? Does it depend on the question you’re trying to answer? Pitchers? Defense? Hmmm…