## The measure of a man, Part IV

Some time ago, I introduced the concept of the proximity matrix as a possible way to generate similarity scores.  A proximity matrix is a nice side effect of a method called cluster analysis, which is a way to determine which two entries in a data set are most similar to one another.  It’s actually the way that a lot of dating sites work.  They measure several personality characteristics about you, and then look for the people who are closest to you in the proximity matrix.

At the time that I wrote the original proximity matrix piece, I used silly stats like AVG and RBI just to demonstrate how it worked.  The idea goes something like this.  Suppose that I had a data set in which there were only two factors that I cared about, power and speed.  There are some players who have a lot of power but no speed.  There are players who are the reverse.  There are some who are kinda middling on both.  You can imagine a two axis graph in which power and speed (however you want to define them) are graphed.  We might be able to draw circles around a group of players who fit into a “type” (high speed/low power; high speed/medium power, etc.)  Of course, the bigger the circle, the less specific the “type”, but the smaller the circles, the more groups you have to deal with, until it becomes unwieldy.  The goal is to find a happy medium between a few big, but heterogeneous groups and a bunch of small, but more specific groups.

How to tell who goes together?  Well, on a two-dimensional graph, you only need the Pythagorean theorem (the actual one, not the Bill James one) to determine the distance between any two points.  But, mathematically you can add as many dimensions as you want, and you can use whatever variables you like.  It works on something called squared Euclidian distance.  But with what set of variables shall we measure a man?  Oh right… (part I, part II, part III)

The four variable structure that I’ve created (Ichiro-Howard, contact, risk, solid contact) is a particularly suited to the rigors of the proximity matrix.  The numbers are engineered to be stable which means these are actually skills (in theory anyway!).  I also designed them to be orthogonal (not correlated) to one another.  We are getting a read of four genuinely independent skills, as opposed to measuring HR and RBI, which are correlated with one another.  Plus, they’re already set up so that they have the same mean and standard deviation.

So let’s dive in, shall we?  Let’s start at the top.  Who, in 2008, had the profile closest to Albert Pujols?  I bet a list of names just came into your head, most likely starting with Alex Rodriguez*.  The connection is obvious.  Pujols’s performance is most similar to Rodiriguez*, if for the fact that they are currently the two best hitters in the game.  It’s also not what these similarity scores are about.

Consider:

 Player Ichiro-Howard Contact Risk Solid Contact Pujols -.73324 1.47552 -.71220 1.72005 A-Roid* .00000 (sic) -1.00445 .02606 1.42174

(All numbers 2008; mean = 0, SD = 1)

Pujols clearly favors the fly ball with little speed (the Howard end), while A-Rod* is actually pefectly balanced between the two extremes.  Pujols has excellent contact skills, while A-Rod is a standard deviation below the mean.  Pujols is a more reluctant swinger and doesn’t seem to take many risky swings.  A-Rod is about league average on that one.  Both are good at hitting the ball a long way.  We generally only look at that last sentence, but it’s clear that they are two different types of hitters.

So who’s actually similar to Pujols?  The top 5 are listed below, in order of similarity.

 Player Ichiro-Howard Contact Risk Solid Contact A. Pujols -.73324 1.47552 -.71220 1.72005 J. Hairston -.76531 1.67798 -.19744 1.84334 B. Giles -.37557 1.62536 -.73731 1.26229 S. Casey -.58657 .93159 -.64673 2.52618 A. Ethier -.38172 1.25153 -.15381 .98470 I. Kinsler -.92335 .74409 -.21759 1.22831

Jerry Hairston?  For what it’s worth, Hairston hit .326/.384/.487 in 2008, although with some major platoon splits.  His solid contact number, which I’ve previously found is the least stable number of the four factors, jumped from -1.60 to 1.84 from 2007 to 2008.  The year before that (2006), he’d been at -1.59.  Hairston is an outlier who had a very lucky year.  Casey is something of a slow plodder, but is clearly half a standard deviation behind Pujols in contact.  His big number on solid contact comes from a high LD% that was over 100-some PA’s last year.  Giles is less of a big fly hitter and doesn’t make as good solid contact.  However, he prefers to sit back and wait (like Pujols does) and prefers a big flyball style (although less so).  In other words, Brian Giles is a poor man’s Albert Pujols.  A very poor man.

Ethier is in 4th place, and you might notice that he’s half a standard deviation behind Pujols in three categories.  Kinsler has the same problem.  Ethier might have a leg up though.  His contact skills will likely get better with age, and he will take fewer risks as he ages, bring him more in line with Pujols, so if there’s someone who’s got a chance to fit the Pujols mold, it’s Ethier, at least according to this.  But, may I offer an alternate conclusion?  There’s no one quite like Albert Pujols!

What about a guy like Evan Longoria?  His top 5 comparables from last year: Nelson Cruz, A-Rod*, John Bowker, Brad Hawpe, and Fernando Tatis.  Tatis had something of a lightning in a bottle year last year, but that’s not a bad list of comparables.  Good power, lots of K’s… makes sense.

 Player Ichiro-Howard Contact Risk Solid Contact E. Longoria -.44246 -1.07672 .20784 1.80886 N. Cruz -.26621 -.85020 .60302 1.47288 A. Rodriguez* .00000 -1.00445 .02606 1.42174 J. Bowker -.19865 -.52904 .02202 1.46338 B. Hawpe -.44018 -.84058 .89830 1.68557 F. Tatis .17184 -.66749 .23918 1.65625

Let’s do one more.  Raul Ibanez.  His Top 5.

 Player Ichiro-Howard Contact Risk Solid Contact R. Ibanez -.80166 .62130 .24022 .67452 C. Coste -.74362 .22284 .12103 .95755 J, Morneau -.85034 .91887 .55823 .17808 S. Rolen -1.39143 .86767 .40410 .54186 J. Kent -.46074 .88277 .81364 .59200 I. Kinsler -.92335 .74409 -.21759 1.22831

Again, no perfect matches, but a bunch of guys who like fly balls, but also make a lot of contact, which suppresses power.  These guys hit a lot of doubles.

Voila!  A good solid way to compare hitters to one another that doesn’t involve hazy qualitative judgments of player abilities.

## A call for clinical Sabermetrics

OK, we get it.  We now know (or at least have half a dozen different contenders for) the proper definition of a replacement player and what the value above that replacement player is in wins, dollars, yen, and quatlus.  We know that multiple-Gold Glove award winner Derek Jeter is actually not a good shortstop.  People who have never been to this blog are using OBP properly in a sentence.  We have uber-stats (plural).  We have un-masked Torii Hunter as a fraud, figured out that Mark Ellis is pretty good, and even discovered that Albert Pujols is a halfway decent baseball player.  We’ve gotten to the point where we can describe a player’s abilities on an array of factors, and we’re pretty good at it.

Now what?

This isn’t a post to say that “there’s nothing more to be discovered in terms of figuring out things about baseball.”  Far from it.  In fact, there is plenty more to look into, and I’ll bet that there are some really interesting findings lurking around the corner.  A year (two? three?) from now, we’ll have new toys that we hadn’t even imagined before.  And two or three of them will be really super cool.  And I’ll spend my free time thinking about them.  And then someone will come up with something else.  Maybe it’ll even be me.

The point is that we’ve really only fought half the battle.  There’s another frontier in Sabermetrics that has only been lightly explored.  I’m a clinical psychologist by training, and there are two parts to my job (three if you count the endless paperwork.)  There’s diagnosing a problem and then there’s treating it.  I would argue that we, as Sabermetricians, are fairly good at the diagnosis part.  We can pick out flaws or strengths in a player’s game that the general public may not pick up on or maybe even the baseball insiders don’t.  But what difference does it make to know that if the conclusions won’t be turned into results?

Here, I’m not so much talking about recommendations like “The Indians should sign this guy!” or “Johnson is a steal at that price!”  Those are good recommendations to be sure, but not what I had in mind.  Here I’m thinking about finding out ways that we can change the players themselves.  Most of the recommendations in Sabermetrics up to this time have been around which players to avoid and which are under-valued.  But that requires signing new players and finding someone gullible enough to take the over-valued guy off your hands.  They’re personnel moves.  It’s just diagnosing the sick and quarantining them.  What about working with the guys you already have?  Can we use Sabermetrics to actually change individual players?

For some things, maybe not.  We probably aren’t going to make Frank Thomas run like Wily Taveras (nor will Wily Taveras ever hit like Frank Thomas).  And no, Jamie Moyer will never throw a 95 mph fastball.  (Something about a silk purse and a sow’s ear…)  We… the royal “we”… can’t change the physical characteristics of a player.  But we can change a player’s mind.

Consider the now-famous post on U.S.S. Mariner concerning Felix Hernandez’s pitch selection.  To simply say that King Felix likes to throw a lot of fastballs early in the game is descriptive (and true).  To point out the obvious that hitters were going to eventually pick up on it is changing the pitcher himself.  Now, his past behavior doesn’t predict his future behavior because of the awareness of the past behavior itself.  I have to wonder how many other pitchers fall into patterns (fastball-then-slider) without thinking about it, patterns that could be uncovered with just a little sleuthing through the data.  Make a pitcher aware of his pattern, and you break the pattern.  Suddenly, he’s a different pitcher.  If you know the answer to the question, it changes the question.

If there’s one mistake that the Sabermetric movement has made over the past few years (perhaps not intentionally, but certainly, it’s been made) is that we’ve reduced players to glorified, if quite advanced, Strat-o-matic cards.  Perhaps the craze around “clutch hitting” a few years back stunted our growth.  The evidence said that clutch hitting didn’t exist, and we over-extended the findings and either started denying that baseball players could be affected by psychological variables or simply stopped researching things like that.  Clutch hitting may not really exist (or more properly, the evidence says that it exists, but is a very minor part of the overall equation), but does that mean that no psychological factors might be in play?  Human beings learn from experience, but our models don’t do a good job taking that into account.  Human beings are prone to being affected by their emotions, perhaps not in the ways that we immediately think of (that’s why we need empirical data… hence the clutch hitting debate), but to suggest that players are immune to emotional and psychological concerns would be to suggest that players are robots.  Sabermetrics hasn’t “gone there” very much, at least not yet.

My call is for a clinical Sabermetrics, one that goes beyond simply seeing probabilities as fixed and sees players as being affected by context.  I want to get inside the mind of a major league player.  It may seem impossible, but as a trained clinician, it’s often fairly easy to figure out what someone’s thinking from their actions.  And if we know that players, either individually or as a whole, are given to certain psychological patterns, we can either encourage them or intervene to stop them (as the case calls for.)  Maybe the effects are small, but then again, maybe there’s something big lurking out there.

## Does PECOTA overestimate the batting averages for fast players?

Normal
0

false
false
false

MicrosoftInternetExplorer4

/* Style Definitions */
table.MsoNormalTable
{mso-style-name:”Table Normal”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:”";
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:”Times New Roman”;
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}

There are a number of projection systems out there for
predicting player performance.  All of
them are pretty good.  They all make
claims of superiority from time to time, but the clear consensus is that there
is no consensus.  In some ways, PECOTA
could be considered the best, but CHONE, ZiPS, Marcel, and many others have
their strengths.  As I was looking
through the projections for this year, I also wondered what the systems’
weaknesses were.  One thing that I
noticed was how high some of the batting averages were for speedy baseball
players for the PECOTA system.  This
year, PECOTA projects batting averages for Jose Reyes, Jimmy Rollins, and
Hanley Ramirez that are more than ten points higher ZiPS and CHONE.

I decided to look at this in a more scientific way.  I went through the PECOTA system’s projections
for 2006-2008 for the 832 players who managed 300 PA during those years.  I calculated how far the players’ batting
averages exceeded their PECOTA projections.
I wanted to compare this to their Speed Score as listed by Baseball
Prospectus according to each of their projections.  I figured that if I simply ran this
regression without a control for PECOTA overestimating a player’s skill, there
would be a bias there (players whose speed PECOTA overestimated would have averages below their PECOTA projection).  So I developed a
control of the difference of their actual stolen base total and PECOTA’s stolen
base estimate.  This should allow me to
isolate whether PECOTA overestimates batting averages for speedsters,
controlling for whether they accurately estimate the players’ speeds.  Here are the results:

 Source SS df MS Obs 832 Model 0.02476 4 0.00619 F(4,827) 9.53 Residual 0.537231 827 0.00065 Prob>F 0 Total 0.56199 831 0.000676 R-sq 0.0441 Adj R-sq 0.0394 RMSE 0.02549 avg-PECavg Coef. Std. Err. t P>|t| 95%Cimin 95%Cimax sb-PECsb 0.000597 0.000131 4.57 0 0.00034 0.000854 pspdtop4th -0.00364 0.002009 -1.81 0.07 -0.00758 0.000302 yr06 0.005326 0.002177 2.45 0.015 0.001053 0.0096 yr07 -0.0028 0.002154 -1.3 0.195 -0.00702 0.001433 _cons 0.002151 0.001604 1.34 0.18 -0.001 0.005299

(avg-PECavg): average minus PECOTA projected estimate of
average

(sb-PECsb): stolen bases minus PECOTA projected estimate of
stolen bases

(pspdtop4th): indicator function equal to 1 if the speed score
were in the top quarter of speed scores in that year (speed scores are measured
on a different scale for each year)

(yr06, yr07): indicator functions equal to 1 if the year was
2006 or 2007, to control for the measurement bias by year.

This is weakly statistically significant, and indicates
PECOTA does in fact overrate speedsters.

I did specifically pick the regression that looked best to
show, but for the sake of completeness, here is the regression with the number
of standard deviations above the mean their speed score was denoted “pspdz” as
a regressor.  This is less significant,
since it seems that PECOTA does not do a better job of projecting slow players
than players with average speed.

 Source SS df MS Obs 832 Model 0.024347 4 0.006087 F(4,827) 9.36 Residual 0.537643 827 0.00065 Prob>F 0 Total 0.56199 831 0.000676 R-sq 0.0433 Adj R-sq 0.0387 RMSE 0.0255 avg-PECavg Coef. Std. Err. t P>|t| 95%Cimax 95%Cimax sb-PECsb 0.000599 0.000131 4.57 0 0.000342 0.000856 pspdz -0.00145 0.000891 -1.63 0.104 -0.0032 0.000299 yr06 0.005156 0.002177 2.37 0.018 0.000883 0.009428 yr07 -0.00279 0.002155 -1.29 0.196 -0.00702 0.001443 _cons 0.001231 0.001521 0.81 0.418 -0.00175 0.004217

Here, “pspdz” is not quite significant, but is not far off.  Since the distribution of “pspdz” (the number
of standard deviations the speed score is above the mean for that year) is not
distributed the same for each year, this is likely not a perfect measurement
and perhaps this is why.

Clearly, model specification is an issue, but I am afraid to
distribute my data since PECOTA projections are proprietary (and I assume historical ones are as well).  For the sake of
transparency, however, I will run regressions that people request by post or email me, with alternative models
using the PECOTA data.

Moving on to 2009, I decided to compare how the top 26
base-stealers as projected by PECOTA (speed score is not listed for 2009 PECOTA
projections) looked compared to CHONE and ZiPS projections.  I dropped the players who did not have any
significant amount of major league experience.
Then I did the same thing for the top 26 homerun hitters as projected by
PECOTA, again comparing those to the CHONE, ZiPS, and Marcel projections.  Sure enough, PECOTA projected the batting
averages for the speedy players higher than CHONE, ZiPS and Marcel, but not for
the homerun hitters.

I would paste in the table here, but again, since PECOTA’s projections are proprietary, I will only summarize the results.

For the 26 speedsters, PECOTA was the highest of the four
systems for 14 of them.  It was the
second highest for 2 of them, third highest for 2 of them, and the lowest for 8
of them.  For the 26 sluggers, PECOTA was
the highest for 7, tied for the highest for 4 of them, second highest for 1 of
them, third highest for 5 of them, and the lowest for 9 of them.  It estimated a batting average ten points
higher than the average of CHONE, ZiPS, and Marcel for 8 speedsters, but for
only 5 sluggers (2 of whom were Beltran and Hanley Ramirez, also speedsters).

The 8 speedsters that it was the highest for were: Jose
Reyes, Jimmy Rollins, Hanley Ramirez, Michael Bourn, Carlos Gomez, Brandon
Phillips, Rickie Weeks, and Nate McLouth.
It was also pretty high on Willy Taveras, Shane Victorino, Juan Pierre,
and Corey Hart.

I would be cautious about trusting PECOTA on these
guys.  It does seem that PECOTA does
indeed overestimate these hitters by a bit.
By the regression estimate, it looks like fast players may get an
exaggerated batting average boost of about 4 points.  I would guess that each of the projection
systems has their weaknesses on certain players.  If it were possible to determine which types
of hitters were better projected by different systems, I think that would be
extremely useful to know.

## The measure of a man, part III

For those who have been following this series, my goal has been to develop a small number of orthogonal (non-correlated) measures that will adequately encapsulate a batter’s offensive talents.  In part one, I explained how I found four such factors, and howthey are derived through a logical flowchart and a factor analytic approach.  In part two, I showed that the four factors are actually useful in predicting player typologies.  In part three, let’s look at whether they are stable and how they change over time.

My hope was that the factors that I created would be stable over time.  I took the original ten factors (strike zone sensitivity, response bias, contact rate, LD/FB/GB percentages, 0 and 1 strike fouls per PA, 2 strike fouls per 2 strike PA, speed score, and power score) from 2003-2008 and put them in a factor analysis.  Same four factors shook out with basically the same loadings.

Getting consistency measures should be easy enough.  I did my usual AR(1) intraclass correlation over four years worth of data (2005-2008).  Things were going great (Ichiro-Howard: .77, contact: .79, risk: .79) until I got to solid contact (.40?)  I specifically built these out of things that showed good reliability.  How did .40 happen?  The two variables that load heavily on “solid contact” are LD rate (ICC = .31) and power score (ICC = .55).  I had previously found them to be more reliable than that.  There are some well-documented problems with how Retrosheet classifies line drives, which may be playing a part here, but I’m not sure what happened.

Now, .40 isn’t a horrible ICC, but it isn’t a great one either.  Let’s assume that it’s a true finding.  It means that making good contact over the course of a year is more noise than signal.  When you think of it, we are dealing with trying to hit a small ball traveling at 85-90 mph (sometimes more) with a stick that’s only a couple inches in circumference, and at that, the stick itself is traveling at a high rate of speed.  I suppose that the angle at which the ball comes off the bat and where it goes is bound to have some element of chance in it.

Onward to the aging patterns.  Actually, they proceed in much the way that you might imagine.  For example, younger players score higher on risk (swing more, have more foul balls for strike 1 and 2, and make less contact).  Older players are more likely to be on the slow flyball hitter end of the Ichiro-Howard spectrum and to be better contact hitter.  Solid contact vibrated all over the place with no pattern.  I looked at them using a simple mean graph by age at first (like the one below for contact), but that’s a flawed method.

Any time you do aging studies, there are a bunch of confounding variables to consider.  A player who is still playing at 36 (and surely collecting free agent level dollars), is a different sort of player than the player who is not playing at 36.  He’s probably a pretty good player to begin with.  How to model this growth curve while at the same time controlling for the fact that our survivors at 36 probably had some pretty good skills to begin with?  Through a process called mixed-linear modeling (MLM).  Actually, intra-class correlation is one part of MLM.  When I do ICC, I’m finding out how much of the variance is accounted for by the player’s own growth curve.  What that actually is is the control mechanism for within subject effects in MLM.

If you control for the within-subjects effects (i.e., the batter himself), you can set up a regression where what you get is the average effect of being a certain age (gory details: enter age as a factor, set to fixed effect, set intercept to random effect).  In theory, I could do this with any stat.  What it does is give the average effect of being X years old (I used April 1st… roughly Opening Day… age for this one) on that particular stat.

Let’s take a look at the coefficients for each age (I only went from 23 to 39).  Remember that all four factors have a mean of zero and a standard deviation of 1.

(if it’s too small for you, click on it… a full sized version will appear)

You have to read those numbers relative to one another.  The average effect of being 27 years old is to be .1688 points (remember mean = 0, SD = 1) below whatever your underlying skill is, which we’ve already controlled for.  The effect of going from 27 to 28 (Happy Birthday!) is to go from -.1688 to -.0742, or roughly .09 points.  That’s an average effect after controlling for talent.  Say that again to yourselves.  It’s an average effect of age after controlling for talent.  So we can more accurately describe the aging process on these (or any) skills using this method.  It’s a nice little way around the selection bias problem, although the problem is that it’s a regression-based method and just shoots through the middle.  It’s possible that different types of players age in different ways and we would have no way to know that using this method.

The thing about three of the four skills (Ichiro-Howard, contact, and risk) is that the aging curves proceed in a fairly linear fashion.  As players age, they become more fly-ball based hitters, who make better contact, and swing less.  Solid contact bobbles all over the place, and seems to be a little bit more random than the others.  In part II, we saw that there were some benefits to being a flyball hitter (more HR) and some to being a groundball hitter (more singles).  Contact was a double edged sword.  If you make contact, you probably won’t strike out as much, but you purchase that at the cost of some power.  Risk-taking is always a game best played in moderation.  Maybe the reason that players “peak” around their mid-to-late 20′s is that this is when the skills reach the point where they are best balanced.

The aging profile for contact (see above) shows that the “age” effects are at their smallest absolute value (so it’s really the player’s true talent shining through, between 28-31.  For Ichiro-Howard, it’s 27-29.  The age effects on risk are pretty low to begin with (except for really young and really “old” players), but they are at their lowest from 27-29.  Presumably, major leaguers are scouted, groomed, and brought up because some scout believed that they have good skills.  The late 20s are when players have those skills the least tainted by age.

So we have four orthogonal factors which are well constructed and show pretty good evidence of reliability and construct validity.  Next time, we’ll discuss how we can take these factors and produce similarity scores that actually make sense.

## Creating A Retrosheet Database, Part I

Okay, I’m presuming that you’ve got an SQL database set up along the lines of the sabermetrician’s workbench I wrote a tutorial on. If you don’t, go do that and come back. I’m also assuming that you run Windows of some sort. I know almost nothing about getting DOS programs to work on other platforms and I know of no tools to parse the event files for any other platform. This is adapted from work by Tom Tango and Mat Kovach.

You are also going to be required to unzip files. Windows XP and later comes with the ability to unzip files, but I use 7-Zip. It’s not necessary, but it will save you a lot of time later.

Warning up front: this may be very time consuming. If your computer is slow, you may be best of running the loaders overnight and then checking on them the next morning.

Start off setting up a directory structure to house the necessary files. Make sure you have a lot of hard drive space free – I’d suggest at least 15gb, maybe more. Here’s the directory structure I use:

• C:\Retrosheet
• C:\Retrosheet\Data\
• C:\Retrosheet\Data\zipped\
• C:\Retrosheet\Data\unzipped\
• C:\Retrosheet\Data\parsed\
• C:\Retrosheet\common\
• C:\Retrosheet\common\programs\

There’s more, but that will do for now. If you would prefer to do it another way, that’s fine, but then you’re responsible for changing, oh, 200+ lines of batch processing and loader code.

C:\Retrosheet\common\programs

Now for the grunt work. Retrosheet maintains a page of all of the event files. Each year has its own page (before 1997, in fact, it was broken down by league.) On each page you will find a link that says “Entire _______ League.” Download that file into your C:\Retrosheet\Data\zipped\ folder. Yes, this is time consuming. Yes, it probably could be automated, although the process for doing so has eluded me. I only used the files from 1953 onward.

Now we want to unzip these files. If you downloaded 7-Zip, you can simply open the zipped folder, highlight all the zip files, and right click on them all. There should be a menu that says 7-Zip, which will have an option on it called “Extract Files…” You want to unzip them into the C:\Retrosheet\Data\unzipped\ folder.

RetroSQL_Loaders.zip and unpack the files inside. There will be three batch files - \$cwevent.bat, \$cwgame.bat and \$cwsub.bat. Place those in the same folder as the extracted files and run them. This will be time consuming, but you should be able to use your computer normally while they run.

Now, take the other files and put them into C:\Retrosheet\loaders\. Use SQLyog to load the files – the keyboard shortcut Ctrl-Shift-Q will let you excutue SQL commands from a file, or right-click in the left-hand pane and select “Restore From SQL Dump.” Run the one called retrosheet_tables.sql first – it doesn’t matter what database you select when running it, as it will create a database called “retrosheet” itself. (Warning – if you already have a database called that, be very careful, as you might lose tables.)

The other three files are loaders for the parsed files you just made. These can take a long time, and may tie up your computer. I suggest running the events query overnight.

These loader queries will populate backup tables for the games and events tables, not the actual tables themselves. Why? Because in order to make these tables work more efficiently, we want to partition them into smaller parts first. (These tables can be huge, and queries on them can be very slow.) To transfer the data from the backups into the real tables, use:

`UPDATE games_bck`
`SET YEAR_ID = SUBSTR(GAME_ID,4,4);`
` `
`UPDATE events_bck`
`SET YEAR_ID = SUBSTR(GAME_ID,4,4);`
` `
`INSERT INTO games`
`SELECT * from games_bck;`
` `
`INSERT INTO events`
`SELECT * from events_bck;`

(Again, warning: these queries can take a loooooong time.)

And that should give you a fully armed and operational play-by-play database spanning over 50 years of baseball. Now, what to do with it? I’ve pulled some code I wrote a while back for you to play with and get an idea of what you can do with the SQL database. It creates yearly run expectancy tables:

`CREATE TABLE re_zero`
`AS`
`SELECT    YEAR_ID`
`    , OUTS_CT`
`    , SUM(IF(BAT_FATE_ID>3,1,0))/SUM(IF(BAT_EVENT_FL = "T",1,0)) AS BAT_RE`
`    , SUM(IF(RUN1_FATE_ID>3,1,0))/SUM(IF(RUN1_ORIGIN_EVENT_ID > 0,1,0)) AS RUN1_RE`
`    , SUM(IF(RUN2_FATE_ID>3,1,0))/SUM(IF(RUN2_ORIGIN_EVENT_ID > 0,1,0)) AS RUN2_RE`
`    , SUM(IF(RUN3_FATE_ID>3,1,0))/SUM(IF(RUN3_ORIGIN_EVENT_ID > 0,1,0)) AS RUN3_RE`
`    , SUM(FATE_RUNS_CT + EVENT_RUNS_CT - IF(RUN3_FATE_ID>3,1,0) - IF(RUN2_FATE_ID>3,1,0) - IF(RUN1_FATE_ID>3,1,0) - IF(BAT_FATE_ID>3,1,0))/COUNT(1) AS FATE_RE`
`FROM retrosheet.events_copy e`
`GROUP BY YEAR_ID, OUTS_CT;`
` `
`CREATE TABLE BASES_CD AS`
`SELECT DISTINCT START_BASES_CD`
```    , IF(RUN1_ORIGIN_EVENT_ID > 0,1,0) AS R
UN1```
`    , IF(RUN2_ORIGIN_EVENT_ID > 0,1,0) AS RUN2`
`    , IF(RUN3_ORIGIN_EVENT_ID > 0,1,0) AS RUN3`
`FROM retrosheet.events_copy`
`WHERE YEAR_ID = 2008`
`ORDER BY START_BASES_CD;`
` `
`CREATE TABLE RE_TEMP AS`
`SELECT YEAR_ID`
`    , OUTS_CT`
`    , START_BASES_CD AS BASES_CD`
`    , BAT_RE+(RUN1_RE*RUN1)+(RUN2_RE*RUN2)+(RUN3_RE*RUN3)+FATE_RE AS RE`
`FROM bases_cd, re_zero;`
` `
`CREATE TABLE RE`
`AS`
`SELECT * FROM (SELECT * FROM re_temp`
`UNION ALL`
`SELECT DISTINCT`
`YEAR_ID`
`    , 3 AS OUTS_CT`
`    , BASES_CD`
`    , 0 AS RE`
`FROM re_temp) a`
`ORDER BY YEAR_ID, OUTS_CT, BASES_CD;`
` `
`CREATE INDEX re_idx`
`ON re(YEAR_ID,OUTS_CT,BASES_CD);`

It’s a bit more complex than what you might see elsewhere – this is to address sample size issues with single-season run expectancy tables. Later on I can show you what you can do with RE tables – make your own linear weights? No problem! Baserunning evaluation? Easy cheesy!

In the meantime, please, think of anything you want to see code for. Any of my articles where I’ve used Retrosheet data is fair game, either here or at THT – if you want to see how I did it, just ask and I’ll see about cleaning up the code.

If you’re looking for additional places to go for info, try the RetroSQL list or the BaSQL wiki/forum.

## IT'S NOT THE ECONOMY, STUPID!: MATCHING THEORY AND THE VOLATILE MARKET FOR HITTERS

Normal
0

false
false
false

MicrosoftInternetExplorer4

/* Style Definitions */
table.MsoNormalTable
{mso-style-name:”Table Normal”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:”";
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:”Times New Roman”;
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}

This past offseason has surprised a
lot of people.  On one hand, Mark
Teixeira, Derek Lowe, CC Sabathia, and Rafael Furcal had no trouble finding
contracts that we would have expected before the recession hit.  On the other hand, Adam Dunn, Bobby Abreu,
Pat Burrell, and Orlando Hudson all received contracts that surprised
many.  The typical reason that most
people give for these weak contracts is the economy.  In this article, I will explain why
microeconomics, and not the macroeconomic outlook, is best suited to explain
this outcome.  I do not believe that any
pitchers received lower contracts than expected this offseason, and the hitters
who did receive weak contracts played similar positions.  Specifically, many played corner outfield
positions.  Each hitter on the free agent
market can fill only one or two positions, and many teams do not have openings
at these positions.  Each pitcher on the
free agent market, however, can fulfill a role that is valuable to all 30
teams.  Therefore, the ratio of suppliers
to consumers for pitching services is always more stable.  Later in this article, I will discuss some
statistical evidence that hints even stronger at this conclusion.

## IT’S NOT THE ECONOMY, STUPID!: MATCHING THEORY AND THE VOLATILE MARKET FOR HITTERS

Normal
0

false
false
false

MicrosoftInternetExplorer4

/* Style Definitions */
table.MsoNormalTable
{mso-style-name:”Table Normal”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:”";
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:”Times New Roman”;
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}

This past offseason has surprised a
lot of people.  On one hand, Mark
Teixeira, Derek Lowe, CC Sabathia, and Rafael Furcal had no trouble finding
contracts that we would have expected before the recession hit.  On the other hand, Adam Dunn, Bobby Abreu,
Pat Burrell, and Orlando Hudson all received contracts that surprised
many.  The typical reason that most
people give for these weak contracts is the economy.  In this article, I will explain why
microeconomics, and not the macroeconomic outlook, is best suited to explain
this outcome.  I do not believe that any
pitchers received lower contracts than expected this offseason, and the hitters
who did receive weak contracts played similar positions.  Specifically, many played corner outfield
positions.  Each hitter on the free agent
market can fill only one or two positions, and many teams do not have openings
at these positions.  Each pitcher on the
free agent market, however, can fulfill a role that is valuable to all 30
teams.  Therefore, the ratio of suppliers
to consumers for pitching services is always more stable.  Later in this article, I will discuss some
statistical evidence that hints even stronger at this conclusion.

## On Compensation

Well hey there!  Long time no see!  I have to apologize profusely about my lack of posts recently.  I hate to use school as an excuse but I’m doing my last term in undergrad and I certainly know where my priorities lie.  Won’t be as long until the next one, promise :).

During this off-season the issue of free agent compensation has been a focal point.  It slowed the beginning of the free agency period and it has delayed and decreased the value of players such as Orlando Cabrera, Adam Dunn, and Rafael Furcal.  Few teams seem willing to give up a first round pick for signing free agents and has thus deflated the value of said Type A players.  In recent years we have seen teams a) let Type A free agents go with less effort to resign for their compensation value, and b) sign as many Type A free agents in one off-season as they can to minimize the relative opportunity cost of each signing.  See: Oakland, Toronto 2006-07 off-season for part a, New York this past off-season for part b.

So first things first, why is there compensation to begin with?  When the players union fought to get free agency in opposition to the reserve clause, baseball was worried that it would just become a money grab and wanted to defend the advantage of the drafting team to retain their players.  So now we have things such as an exclusive negotiation period and Type A/B free agents.  Type A free agents was intended to benefit the lower fifteen teams and make players more inclined to stick with their previous team, which in a way this off-season has succeeded in the original intent.  Or has it?

Few players or agents predicted the bottom falling out of free agency the way it did or else many players would have agreed to the arbitration offer.  So the teams having the players did not tend to retain their services.  In addition, by having decreasing draft picks per free agent signing, the Yankees were able to grab up a slew of Type A free agents for relatively less each pick.  Since the opportunity cost was lower for the Yankees to sign AJ Burnett than, say, the Red Sox, the Yankees could afford to pay him more in real contract dollars – increasing the standard for the rest of the teams.

So not only are the players not retained by their teams it could be argued that the compensation system may be increasing the (seemingly) exponential salary growth of baseball’s top players.  I don’t think that’s what the owners wanted when they put the condition in.  The compensation system is rather flawed without even getting into how archaic the Elias Sports Bureau’s rating system is.

How does baseball fix this?  Baseball’s Collective Bargaining Agreement expires at the end of 2011 and there have been rumblings that this will be one of the issues the Major League Baseball Player’s Association will bring up during negotiations.  They could take cues from the other big three professional sports leagues in North America which have systems quite a bit different (just don’t ask me to explain the NFL’s).

Trade deadline day in hockey had be thinking a bit.  NHL players are traded to contending teams in exchange for draft picks as opposed to baseball where teams accumulate more draft picks by picking up those rent-a-players – NHL teams don’t gain compensation for losing free agents.  Their method leads to bigger drafts by the non-contending teams and worse drafts by competitive teams, presumably leading to faster team turnarounds.  It is a lot harder to value baseball draft picks however, and the NHL does have that new salary cap to clamp down on rampant salary inflation.

It’s a difficult question and I certainly don’t claim to have any answers.  I might post more later as I think about and discuss the concept.  Since we’re doing business analogues, any industry have anything close to similar?

## Skills, Repeatability, and Peripherals

Normal
0

false
false
false

MicrosoftInternetExplorer4

/* Style Definitions */
table.MsoNormalTable
{mso-style-name:”Table Normal”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:”";
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:”Times New Roman”;
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}

With Pitch FX a few years old and Hit FX around the corner
for next year, I thought it was important to figure out what exactly the
sabermetric community should be using this new information for when it comes
out.  The argument that is frequently
made for looking at peripherals statistics is that they are more
repeatable.  For instance, strikeout
rate, walk rate, and homerun rate are more repeatable for both pitchers and
hitters than batting average on balls in play.  As a result, researchers have started studying
pitcher and hitter performance by relying on more repeatable skills.  In fact, newer statistics such as contact
rate, swing rate, and many others are being used to help determine the
reliability of those statistics.

I do not mean to critique this form of research, but rather,
I intend to specify its focus.  I believe
that peripheral statistics are most useful when dealing with less data.  Many statistics–even those with relatively
high autocorrelation–suffer from small sample size when you only have one year
of data to choose from.  They are
imperfect approximations of the player’s true skill level.  Hence, sabermetric researchers look for more
reliable statistics to determine what the player’s true skill level is.

a few different statistics and a clear pattern will emerge–peripheral
statistics are only useful when you have insufficient data on the statistic you
seek to predict.

First consider strikeouts per at-bat for hitters with more
than 300 PA for any consecutive pair of years within 2005-2008.

 Source SS df MS #Obs 625 Model 1.754011 4 0.438503 F(4,620) 512.17 Residual 0.530821 620 0.000856 Prob>F 0 Total 2.284831 624 0.003662 R-squared 0.7677 Adj R^2 0.7662 RMSE 0.02926 K%2 Coef. Std.Err. t P>|t| 95%CI min 95%CI max K%1 0.710873 0.047914 14.84 0 0.61678 0.804965 O-Contact%1 -0.07572 0.01615 -4.69 0 -0.10744 -0.04401 Z-Contact%1 -0.10515 0.061988 -1.7 0.09 -0.22689 0.016578 Swing%1 -0.04951 0.025225 -1.96 0.05 -0.09905 2.36E-05 _cons 0.212677 0.067814 3.14 0.002 0.079503 0.34585

Here, K%2 is K/AB in year 2 and K%1 is K/AB in year 1,
O-Contact%1 is Contact rate per swing on pitches out of the strike zone in year
1, Z-Contact%1 is Contract rate per swing on pitches in the strike zone, and
Swing%1 is percent of pitches swung at in year 1.  Note that these peripheral statistics are
significant with only one year of data available.

However, when you increase the sample size a little bit, add
in another year, and these statistics are no longer useful.  Consider the following regression output:

 Source SS df MS #Obs 137 Model 0.379447 5 0.075889 F(5,131) 108.58 Residual 0.09156 131 0.000699 Prob>F 0 Total 0.471007 136 0.003463 R^2 0.8056 Adj R^2 0.7982 RMSE 0.02644 K%08 Coef. Std.Err. t P>|t| 95%CI min 95%CI max K%06 0.269959 0.083118 3.25 0.001 0.105532 0.434386 K%07 0.548566 0.115553 4.75 0 0.319975 0.777158 O-Contact%07 -0.01525 0.041816 -0.36 0.716 -0.09797 0.067474 Z-Contact%07 -0.09129 0.11999 -0.76 0.448 -0.32865 0.146084 Swing%07 -0.02695 0.05308 -0.51 0.613 -0.13195 0.078059 _cons 0.133399 0.142735 0.93 0.352 -0.14896 0.415762

While the statistics maintain their original sign, they are
no longer remotely statistically significant at all.  Adding in a third year only further
strengthens this case.  In fact, the
strikeout rate statistic itself from the previous year is more relevant than
the peripheral statistics from the previous year if you did have to choose
between one or the other.

 Source SS df MS #Obs 625 Model 1.565549 3 0.52185 F(3,621) 450.24 Residual 0.719282 621 0.001158 Prob>F 0 Total 2.284831 624 0.003662 R^2 0.6852 Adj R^2 0.6837 RMSE 0.03403 K%2 Coef. Std. Err. t P>|t| 95%CI min 95%CI max Ocontact%1 -0.1576 0.017654 -8.93 0 -0.19226 -0.12293 Zcontact%1 -0.82171 0.045198 -18.18 0 -0.91046 -0.73295 Swing%1 -0.20744 0.0266 -7.8 0 -0.25968 -0.15521 _cons 1.096603 0.037677 29.11 0 1.022614 1.170592

 Source SS df MS #Obs 625 Model 1.729715 1 1.729715 F(1,623) 1941.24 Residual 0.555116 623 0.000891 Prob>F 0 Total 2.284831 624 0.003662 R^2 0.757 Adj R^2 0.7567 RMSE 0.02985 K%2 Coef. Std. Err. t P>|t| 95%CI min 95%CI max K%1 0.888271 0.020161 44.06 0 0.84868 0.927862 _cons 0.020907 0.003741 5.59 0 0.013561 0.028252

The R^2 statistic is far larger for regressing K% in the
second year on K% in the first year than by trying to construct a method to
predict K% in the second year as a function of contact rate on pitches in and
out of the strike zone, and swing rate.

Walk rate is similar.
Initially, adding in peripheral statistics helps predict walk rate.  Note the statistical significance:

 Source SS df MS #Obs 625 Model 0.455716 2 0.227858 F(2,622) 478.94 Residual 0.29592 622 0.000476 Prob>F 0 Total 0.751636 624 0.001205 R^2 0.6063 Adj R^2 0.605 RMSE 0.02181 BB%2 Coef. Std. Err. t P>|t| 95%CI min 95%CI max BB%1 0.719233 0.032692 22 0 0.655034 0.783432 O-Swing%1 -0.05972 0.018512 -3.23 0.001 -0.09607 -0.02337 _cons 0.040446 0.006507 6.22 0 0.027668 0.053224

Of course, add in a second year of data, and it is no longer
useful to include O-Swing% from the previous year.

 Source SS df MS #Obs 137 Model 0.110236 3 0.036745 Prob(3,133) 91.06 Residual 0.053671 133 0.000404 Prob>|F| 0 Total 0.163907 136 0.001205 R^2 0.6726 Adj R^2 0.6652 RMSE 0.02009 BB%08 Coef. Std. Err. t P>|t| 95%CI min 95%CI max BB%06 0.256429 0.079107 3.24 0.002 0.099959 0.412899 BB%07 0.535592 0.088404 6.06 0 0.360732 0.710451 O-Swing%07 -0.05531 0.037621 -1.47 0.144 -0.12972 0.019107 _cons 0.034832 0.01453 2.4 0.018 0.006093 0.063572

In the interest of space, I will leave out some other
regressions I ran but the same phenomenon occurred for log-homerun rate for
hitters, strikeout rate for pitchers, and walk rate for pitchers, and several
other statistics exhibit similar patterns as well.

The general point that I am making is that as Hit FX and
more statistics become available, the statistics that we use that better
represent certain skills–contact rate, swing rate, groundball rate, etc.–are used
differently by different hitters to yield different results.  As we try to predict different results, the
most useful statistics to use to predict them are often historical records of
those very statistics themselves.  In other words, these new statistics are going to be most useful when trying to predict second year players, and not going to help one add insight into predicting the performance of veterans.