What run estimator would Batman use? (Part I)

[A note from the author: This study ended up becoming more involved than initially suspected, mostly because the author is bad at estimating such things. As such, this is the first part of the piece, which will eventually be published in two or three parts, depending. This part isn’t very technical, and largely concerns itself with the theory behind run estimation. I state this up front so that you don’t get 2,000 words into the document only to be disappointed that not a single run estimator has been evaluated at all.]

This isn’t the first study on run estimator accuracy, and I don’t promise it will be the most thorough. But I’ve been skirting around the issue in my previous work here, and so I figured it was time to finally get around to doing it proper, so that I can just have something to conveniently reference every time it comes up in the future.

Most previous studies of accuracy have concerned themselves with accuracy at the team level, using seasonal totals. This makes sense for a lot of reasons – run scoring is a team process, and team level run scoring data is readily available for entire seasons. Here’s the rub, though – estimating runs at the seasonal team level isn’t that hard. Here’s a look at the distribution of team runs per game, 1954-2007:



Notice how everything bunches up in the center? That’s because there isn’t a vast difference in run scoring totals between teams over the course of an entire season. That’s how you can explain the sterling accuracy of my latest run predictor, using runs per game:

Avg. Error

Okay, so it’s not even as good as, say, batting average at predicting team run scoring. But it’s pretty decent, considering I just assumed every team was league average.

Read more of this post


Power scores (or at least my attempt)

A while ago, I took on the task of building a better speed score.  Bill James had come up with his formula some time ago, and for what it’s worth, I found that his formula (warning: PDF) was pretty good.  My formula was much more stable over time… but it was also a lot harder to calculate.  So, while we can get a single number to describe speed, something that has a bearing on several different events in a game (stolen bases, triples, staying out of a double play), I’ve never seen an attempt made at a “power score.”  I figured I might as well give it a shot.
First, we identify events in a game that might involve “power.”  For example, power would clearly be involved in home run hitting, but what else might be involved?  I made a list of stats and rates that might just be involved.

  • Home runs per fly ball.  It’s hard to hit a home run on a ground ball.  But, a player with power would likely have the power to put a fly ball over the fence.
  • Fly balls fielded by outfielders rather than infielders (whether caught on the fly or not).  It seems sensible that even if a fly ball doesn’t leave the park, it is still the mark of a more powerful hitter that he would put more balls further away (the outfield) from him than close to him (the infield).  (Formula: OF fly balls / total fly balls and popups)
  • Doubles and triples per ball in play.  A grounder or line drive could end up going for a double or triple, but certainly, they are more likely on a fly ball that hits the wall.  In either case, the ball was probably hit pretty hard.
  • Line drive rate.  Part of hitting for power is making good solid contact, and line drives are the sign of good solid contact.
  • Ground balls fielded by outfielders rather than infielders.  Again, a ground ball that goes through the infield is more likely to have been hit harder (or perhaps just placed better?) than if it was fielded by an infielder.  Again, it doesn’t matter if it went for a hit (the shortstop got to it, but couldn’t make the throw), just where it landed.
  • ISO.  This is supposed to be a measure of “isolated power.”  The formula is SLG-AVG.  Let’s see what happens.
  • BABIP.  Not often that we get to talk about BABIP from the batter’s perspective.  But again, balls in play that go for hits mean that fielders had a hard time getting to them.  What’s one way to give a fielder a hard time getting to the ball?  Hit it really fast or really hard.

Like my methodology for calculating speed scores, I was dealing with a lot of probability numbers (with the exception of ISO).  Probability distributions are notoriously not normal, so I applied a normality transformation by taking the natural log of the odds ratio.  I restricted myself to players who had at least 100 PA in the season in question, and I had a database stretching from 2000-2007.  I converted all natural logs of the odds ratio to Z-scores based on the distribution present in the year in question (to get everything into the same basic range of scale).  I then subjected these Z-scores to an exploratory factor analysis, with a Varimax rotation, to see which of these variables hung together.  I saved factors with an Eigenvalue over 1.00.  If you have no idea what I just said, just trust me on this one. 
The results were a little bit surprising.  I got two factors (gory detail: picked up 59.7% of the variance present.)  So far, so good.  HR/FB, XBH, and ISO hung together, as might be expected.  Outfield flies also was part of this factor, although it loaded negatively.  So, we would expect someone who hits a lot of homeruns and doubles to hit fewer fly balls to the outfield (or more to the point, more infield flies.)  The other factor that emerged was a combination of BABIP, ground balls that go through to the OF, and line drive rate.  (gory detail: There was very little in the way of cross-loading factors.)
So, home runs generally are accompanied by other extra base hits, and that generally pushes the ISO up (not a huge surprise that those would all hang together).  But, something that speaks of a power hitter is actually (comparatively) a lot of infield pop ups.  We already know that power hitters are given to striking out and that they hit a lot of foul balls, but it also looks like they have a propensity to hit infield pop ups.  Seems that trying to hit big fly balls has plenty of risks.  Swinging really hard is bought at a price of lowered plate coverage, but also it looks like the ability to control the bat angle goes.  Get the horizontal angle wrong, hit it foul.  Get the vertical (bat impact angle) wrong, hit a harmless popup.  Get it all right, fireworks.
To check to see whether my two new scales were consistent over time, I looked at their intraclass correlation over four years (2004-2007).  The first factor (call it “big fly power”) had an ICC of .740 indicating excellent consistency over time.  The second factor (call it “solid contact power”) had an ICC of only .380, which really isn’t all that good (not horrible, but not great).  Since two of the components are getting the ball through the infield on a ground ball and getting a hit when the ball’s in play (we might, thus, call this one “hitting for average”), there’s something to be said for the fact that this skill is only moderately consistent from year to year.  What might stand in the way of those two skills?  The defense.  Trying live off of hanging back and making solid contact has its own risks.  If you hit it where they ain’t, bully for you.  If the defense can cover a lot of ground, you’re going to have some issues.  There’s no defending a fly ball that either hits the wall or goes off it.
Now, why go to all this trouble (and what the heck is exploratory factor analysis?) to figure out power numbers.  Can’t one use ISO or HR/FB or something like that on their own?  Sure, ISO and HR/FB correlate well with the “big fly” factor (.931 and .872, respectively, meaning that they parallel each other very closely)  But they are not as consistent over the years for batters (ICC’s of .648 and .675, still quite good, but not as good as the total factor).  Here’s the beauty of exploratory factor analysis and scale construction.  Put a few things together that are correlated anyway and the possible random variations to the extreme in one can be balanced out the others and make for a more stable whole.  If you want a good number that will stay more consistent over the years, use my power number.  If you want a quick and dirty number that’s really easy to get, go with ISO.
While I was in the neighborhood though, I ran a correlation matrix and found that ISO and BABIP are actually un-correlated with one another.  Looks like the old scouting adage about “hit for power” and “hit for average” being two separate tools is accurate.  One can be both or neither (well, if you’re in MLB, you’re at least one), but one doesn’t tell you much about the other.
For those interested, I’ve posted the 2007 list here, sorted as always by Retrosheet ID.

Adjusted W-L: A Study of the Unlucky

If you have read any of my work on Starting Pitchers and SP Effectiveness it will come as no surprise that I strongly dislike Win-Loss records. 
In the 2005 season, Johan Santana posted the following numbers-

  • 16-7 actual W-L
  • 2.87 ERA
  • 7.02 IP/gm
  • 231.2 IP
  • 0.97 WHIP
  • 5.29 K:BB
  • 3 CG/2 SHO
  • 33 Games Started

In 2005, Bartolo Colon won the AL Cy Young Award.  Any idea of how many of the above categories, which we all intuitively equate to pitching effectiveness, Colon outranked Santana in? 
One.  One category.  Colon beat Santana in only one category in 2005.  Care to venture a guess to which it was?  Combine my sarcastic tone with the title/first line of this article if you need help.  That’s right.  The one category he outperformed Santana in was WINS, 21-16.  Santana outperformed Colon in every other statistical category in 2005 and somehow lost the Cy Young.  Not to take anything away from Colon’s season but he clearly did not perform better than Santana in any category other than wins and they had the same number of starts.  And to say that the Angels made the playoffs strictly because of Colon is just slightly over borderline ridiculous. 
For reasons unbeknownst to me, W-L has become an extremely significant barometer when measuring the quality of a season and of a career.  We invest a ton of stock into a statistic that paints us half of a whole portrait.  Ask yourself this – what does a W-L record tell us?
Does it provide a ratio of how often someone pitched well to how often he didn’t?  No, because a Win does not always equate to a well-pitched game and a loss does not always equate to a poorly-pitched game.
Does it take into account the fact that some teams score more than others?  No, because you get credited with a win if you last at least five innings and your team never relinquishes the lead once you leave.  It does not matter if you give up six runs in seven innings as long as you meet that above criteria.
A few weeks back I introduced my statistic, AQS – Adjusted Quality Start, which refers to when a pitcher either goes 6+ IP while surrendering 3 or less earned runs or 7.2+ IP while surrendering no more than 4 earned runs.  Using the AQS allows us to find the ratio, mentioned in the question above, of how often a pitcher performed well in comparison to not performing well.  Regardless of whether or not you received the deserved decision, or whether or not you even received a decision, if you meet the criteria of an AQS it means you pitched well and, in theory, deserve to win.
Springboarding off of the AQS, I began to separate W-L records into what they really were – a combination of Cheap Wins, Tough Losses, Legitimate Wins, and Legitimate Losses.  The legitimate decisions refer to games that a pitcher either recorded an AQS, and won, or did not record an AQS and lost.  The reverse can be said for the Cheap Wins/Tough Losses.  Failing to record an AQS and getting a win really should not happen and the same can be said for garnering a loss while recording an AQS.
I will use the 2007 season of John Smoltz to put this to use.  By all accounts he had a great year but he often gets lost in the Peavy/Webb shuffle when discussing the best in the NL this past season.  Peavy won 19 games, Webb won 18, and Smoltz only won 14.  Something deep down tells us that Smoltz had a better season than his 14-8 record would indicate, but how much better?
Looking more closely at his 14-8, we see that he had 0 Cheap Wins, 5 Tough Losses, 14 Legit Wins, and 3 Legit Losses.
If we take the Cheapies and Toughies out, Smoltz is left with a 14-3 record of legitimate decisions.  I want to go a bit further, though, because he recorded 22 decisions no matter how we look at it.  He legitimately deserved to go 14-3, but there were five games he lost that he pitched well enough to win.
With that in mind, I began to adjust the W-L records of pitchers and see what would happen if they were credited with a Win for every Tough Loss and a Loss for every Cheap Win, on top of the Legit Wins and Legit Losses.
When we apply that to Smoltz, his 2007 Adjusted W-L would be 19-3.  When we do the same to Peavy and Webb we get a 21-4 record for Peavy and a 20-8 record for Webb.
Essentially, Smoltz should have won 19 of his 22 decisions, Peavy should have won 21 of his 25 decisions, and Webb should have won 20 of his 28 decisions. 
If we are going to use W-L record as a barometer of quality, then we should use this Adjusted W-L instead since it actually does give us the ratio of how many times a pitcher performed well relative to the decisions he received.
Below is a table featuring the Actual W-L records and the Adjusted W-L records of some NL pitchers from 2007.





Jake Peavy




John Smoltz




Cole Hamels




Brad Penny




Tim Hudson




Ted Lilly




Matt Cain




Ian Snell




Dontrelle Willis




Adam Eaton




As we can see, Brad Penny had the best Adjusted W-L of any NL pitcher as he truly deserved to lose only one of his decisions.  If he received proper run support and was a bit luckier in the games he recorded decisions, he would have posted a 19-1 record.  I wonder if it would have been a different Cy Young picture if he did. 
Look at the cases of Matt Cain, Dontrelle Willis, and Adam Eaton.  Cain finished the season with an actual W-L of 7-16, even though he deserved to go 16-7.  That means he was unlucky nine times.  Dontrelle Willis should have been 15-10 even though he ended up 10-15, meaning he was unlucky five times.  Yes, by all accounts Dontrelle had a down season, but he did really deserve to win 15 of his decisions. It was just how bad his 10 deserved losses and no-decisions were that turned his season upside down.
On the flip-side, Adam Eaton finished the season 10-10, even though he deserved to be 6-14.  While Cain and Willis were very unlucky, Eaton turned out to be lucky four times.
When we look at the number of Cheap Wins and Tough Losses, we can subtract the difference, express it as a + or – number and detail which pitchers were the luckiest and unluckiest.  This is a bit different than the Pythagorean Formulas used to determine what a team’s record should be.  The team formulas look at the season, as a whole, and provide estimates as to what an overall record should be based on how many overall runs are scored and given up.
It does not make sense to use that here, because if a pitcher gives up 10 runs in Game 1, and 1 Run in Game 2, the average would come out to two bad starts, even though the starts are completely separate and the damage was done in one game.  The team formulas evaluate the entire forest without looking at each individual tree.
Looking at each individual tree needs to be done to really show which pitchers were luckiest and unluckiest.
In the case of Cain, he had 0 Cheap Wins and 9 Tough Losses.  Net Luck = 0 – 9, meaning that Cain had a Net Luck Rating of -9, or in other words was very unlucky.  There were no recorded Wins that he should have lost but there were nine recorded losses he should have won, or at least not recorded a loss.
Adam Eaton had 5 Cheap Wins and 1 Tough Loss.  5 – 1 = 4.  Eaton’s Net Luck was +4, meaning he was lucky four times.  Positive numbers correspond to being lucky, negative numbers correspond to being unlucky, and 0 corresponds to receiving exactly what you should have received.
Aaron Harang was 16-6 with 0 Cheap Wins and 0 Tough Losses.  He had a great season and deserved to go 16-6 in his decisions.  He would have a Net Luck Rating of 0, since he was not lucky or unlucky.
When pitchers tie in either luck or lack of luck the statistic we should look to is AQS %, which refers to the percentage of times a pitcher recorded an AQS.  With lucky pitchers, a lower AQS % tells us they pitched well less, and so they are luckiest because they recorded the most amount of Net Luck while pitching well the least amount of time.  For unlucky pitchers we look at the highest percentage because it tells us that the pitcher was not only unlucky enough to lose games he should have won but that he also pitched well a higher percentage of times.
For instance, Scott Olsen, Adam Eaton, and Byung-Hyun Kim all tied with a +4 Net Luck Rating, meaning they were the luckiest NL pitchers.  Olsen had an AQS % of 33.3, Kim at 27.3, and Eaton at 26.7.  Therefore, Adam Eaton was the luckiest NL pitcher because he received four positive decisions that were unmerited and pitched well the least amount of time.
Though Cain, Bronson Arroyo, and Derek Lowe all ranked higher than Dontrelle and Smoltz, the latter two finished at -5.  Dontrelle had an AQS % of 57.1 and Smoltz at 84.4 %.  Therefore, Smoltz was unluckier than Willis because he received five negative decisions that were unmerited and pitched well way more often.
When we apply Net Luck to every pitcher in 2007, in both the NL and AL, we get the following results –

  • Luckiest NL SP = Adam Eaton (PHI), +4
  • Luckiest AL SP = Odalis Perez (KC), +4
  • Unluckiest NL SP = Matt Cain (SF), -9
  • Unluckiest AL SP = Dan Haren (OAK), -6

Though Haren pitched well and still finished 15-9, he should have been 21-3.  Odalis Perez actually tied Felix Hernandez of the Mariners at +4, but Hernandez’ AQS % was 57.1 whereas Perez came in at 30.8.
Honorable Mentions for Luck in 2007 go to:

  • Scott Olsen, +4
  • Byung-Hyun Kim, +4
  • Paul Byrd, +3
  • Boof Bonser, +3
  • Jeremy Bonderman, +3

Honorable Unlucky Mentions in 2007 go to:

  • Bronson Arroyo, -7
  • Derek Lowe, -6
  • John Smoltz, -5
  • Mark Buehrle, -5
  • Gil Meche, -5
  • Dontrelle Willis, -5

Though I do not have all of the data compiled right now, something I am going to investigate over the next few weeks are which pitchers, from 2000-2007, have been the luckiest and unluckiest.
Another usage of Net Luck that fascinates me, and that I am currently researching for my book, involves an application to 300 game-winners, as well as those who are close.  Something tells me that I will find some guys with 300 wins who maybe should not have 300 wins, as well as some guys who are short of 300 that really should have it.  After all, if we are going to use 300 wins as a Hall of Fame barometer, we should at least make sure the wins are deserved.
I am currently involved in conducting this research and if anyone would like to help, please get in touch with me.

2007 NL Starting Pitching Analysis

When it comes to analyzing and comparing pitchers, those conducting the comparisons will often find themselves in a tricky situation.  Sure, certain pitchers are better than others, but what are they specifically better at? 

How can we conduct an honest analysis when there are so many variables to consider?  And how can we truly determine which pitchers were better than others when some are on terrible teams with no run support and others are on tremendous teams with tons of run support?
The first step is to determine what we are measuring.  If we want to know who the best strikeout pitcher is, we should look at the raw total for strikeouts and also an average of K/IP, since some guys will make less starts than others.  To figure out who walks the least, we measure the number of walks each pitcher gives up and a walk-IP ratio.
These measurements are contingent on one category, though, and cannot tell us who is better or more effective than the rest.  All of the research and ideas presented in this article are designed to measure the “effectiveness” of a pitcher. 
In order to determine this effectiveness, a whole heck of a lot of numbers need to be measured and properly weighted/scaled so that everybody has a fair shot – whether or not they are on a great team.
I took the 1-3 best pitchers from each National League team and entered their statistics into a database, measuring everything from their raw Innings Pitched totals to their Adjusted Quality Start % (you’ll read more on that below).  After entering all of the statistics, and crunching numbers until my brain turned to mush, I came up with my weighted points system.  I assigned the corresponding point totals and added everything up to determine what I feel is a very accurate measurement of pitching effectiveness amongst the NL’s best. 
This was not applied to every single NL Pitcher in 2007 (I will do that another time) but rather amongst these 30 selected #1, #2, or #3 starters.  For instance, a guy like Jeff Suppan may have been more effective than Jason Bergmann but I wanted to have at least one person from each team.
The system is not 100% perfect and does not take into account every single statistic (do you know how many statistics there are??), but it definitely levels the playing field between those on good or bad teams, those injured/called up or just plain bad, and those who got lucky or unlucky with run support.  The points are assigned based on the areas I, as an intense student of the game, feel are most important to determine true effectiveness. 
The basic idea of this system is to measure the true quality of a pitcher over his season – IE, what would happen if a pitcher was rewarded every time he pitched well and discredited every time he pitched poorly – something that happens perfectly just about 0% of the time. 
We will begin by going over the statistics involved, what their points scale was, and why they are used.  The idea behind these corresponding point totals is to properly weight the areas in which most people intuitively attribute to success and quality.
The points given to each statistical subset are designed to separate the aces from the workhorses and the workhorses from the seemingly replacement level pitchers.  They may seem arbitrary and could be replaced with different numbers, or fractions/decimals, however the difference between the points in subsets was based on the amount of pitchers who fall into certain categories.
In order to be as effective as possible, a pitcher needs to make as many starts as he can.  How can we say that a pitcher with 14 starts is more effective than one with 34-35, even if his numbers in those 14 starts are tremendous and the numbers of the one with 34-35 are a bit worse?  His numbers may be better than the pitcher with 35 starts, however the latter pitcher was involved in 21 more games and proved to be durable enough to pitch an entire season, and solid enough to maintain his SP status for 162 games. 
This does not mean that a pitcher with 35 starts is necessarily “better” than one with 14-16, but rather he is more effective because he is involved in more of his team’s season. 
If the pitcher with 14-16 starts posted the same numbers in 32 starts, it would not be a contest.  But, he didn’t – it was only 14-16.  You cannot have as much of an effect on your team (actual play, not motivational or anything) unless you are out there as often as possible.
***What the end result of this effectiveness points system showed is that those with average numbers, over 30+ starts, were equally as effective, or slightly better/worse, than those with good numbers over 16-20 starts.***
If somebody makes only 14 starts in a season, it could be because he was injured for half of the season or was called up from the minors during the season, so he should not be penalized with negative points for that – he just should not be rewarded as highly as someone with 30+ starts.

  • if over 30 starts, +5
  • if 25-29 starts, +3
  • if 20-24 starts, +2
  • if under 20 starts, 0

Just like Games Started, IP can only get you positive numbers, because the low raw number of IP can be attributed to injury or a midseason call-up.  Those with more IP get higher point totals, though.  The reason for 0 points for under 100 innings is because you were not necessarily a bad pitcher, but the lack of innings (whether due to injury or a call-up) limits the effectiveness.

  • if 230+, +8
  • if 220-229, +7
  • if 200-219, +5
  • if 150-199, +3
  • if 100-149, +2
  • if under 100, +1

This is where negative numbers can begin.  If you were hurt, or called up from the minors, you are not penalized with negatives for the raw number of innings pitched or games started, but if you posted a high number of starts and low number of innings, this statistic will bite you in the rear.  IP/Game separates the hurt or called up from the downright below average or bad.  It also helps reward those with a couple less starts than others but with more raw innings pitched.  These types of pitchers were in the same GS range but some went deeper into games than others.  Nobody averaged over 7 IP/gm, so we start lower.

  • if 6.5-7 IP/gm, +7
  • if 6.0-6.49 IP/gm, +5
  • if 5.5-6 IP/gm, +3
  • if 5.0-5.5 IP/gm, 0
  • if below 5.0 IP/gm, -5

If you cannot average over 5 innings per game, or exactly 5 innings per game, you should not be a starting pitcher.  Even Adam Eaton averaged over 5 IP/gm in 2007.
Quality Starts can be an inaccurate statistic because it takes into account games in which a pitcher goes 6+ innings and gives up no more than 3 earned runs… and nothing else.
If a pitcher goes 8.1 innings and gives up 4 runs, it is arguably the same ratio and an equal game in terms of quality, but does not get counted as a quality start.
With that in mind, I came up with the stat of Adjusted Quality Starts, which takes into account all regular quality starts as well as games in which someone goes 7.2-9 innings and gives up no more than 4 runs.  This measures the true number of games in which a pitcher had a good-great performance.
***If you wonder why it is 7.2 IP, instead of 8, the number was derived from the amount of times a pitcher was lifted after 7.2 IP for a specialist, or other sort of reliever, and from the sheer low average of innings pitched/game by a starter this year.  Reaching the 7th inning is now a great feat, let alone coming within one out of finishing the 8th.  Though the previous ratio for a QS was 2:1, due to the data mentioned above, going an extra 1.2 IP to get to 7.2 IP merits being able to give up one more run.***
I used the percentage of AQS to the total number of Games Started to measure effectiveness in this area.  Someone over 75% almost always pitches a good-great game, whereas someone under 50% only pitches a good game less than half of the time – not very effective.

  • if AQS % is above 75%, +5
  • if AQS % is 67-74%, +3
  • if AQS % is 50-66%, 0
  • if AQS % is below 50%, -3

If you’re keeping score at home, AQS= 6+IP with ER =< 3, AND, 7.2+IP with ER =< 4, where =< is the blog version of greater than/less than or equal to. 
In addition to AQS, something that needs to be taken into account is how often a pitcher went for a complete game, since they are so rare.  We also need to take into account a shutout, since they occur even less. 

  • For every CG, +2
  • For every SHO, additional +1

***NOTE: Aaron Harang had two games in 2007, one where he went 9 IP, and one where he went 10 IP, when he did not get a decision.  Even so, I am counting these 2 as a combined 1 CG, since he went 9+ innings.***
W-L Records are the most deceiving statistics because they do not take into account the true quality of the games pitched.  Just because a pitcher goes 14-7 does not mean he was necessarily a great pitcher.  He could have pitched terribly and had great run support in 10 of 14 wins, but brilliantly with terrible run support in the 7 losses.
The whole point of the adjusted W-L records is to get an AQS, since that means you pitched well and should be rewarded, even if your team (offense or bullpen) does not help you. 
After all, Ian Snell cannot control the Pirates’ offense.  It is not his fault that 4 of his 12 losses were “Tough Losses” and all 11 of his No-Decisions were games in which he pitched brilliantly and had an AQS, yet he received little to no offense to help garner him a ‘W’.
With that in mind, I changed W-L to the following 5 stats:

  • Cheap Wins: wins in which one does not get an AQS (-1)
  • Tough Losses: losses in which one does get an AQS (+2)
  • Legit Wins: wins in which one does get an AQS (+2)
  • Legit Losses: losses in which one does not get an AQS (-2)
  • ND-AQS: no-decisions in which one gets an AQS (+1)

I received some questions for how these numbers came to be, and to keep it simple, the statistics that actually have an effect on the W-L record are valued higher (negatively and positively) than the statistics like ND-AQS, which prevent a pitcher from winning but do not hurt him with a loss.
ND-non AQS is not used here for the same reason that Cheap Wins is only negative one, which is that not every Cheap Win or ND-non AQS was a terrible start.  A large bulk of them were games in which a pitcher had a good outing but only went 5 or 5.1 innings.   Cheap Wins loses you a point (not two, only one) because you do not get an AQS but it does effect your win-loss record.  ND-non AQS means you do not get an AQS but it does not effect your win-loss record, which is why I decided to just leave it out.
Though I am not too fond of this statistic and originally tinkered around with separately evaluating H/IP and BB/IP, using WHIP just seemed to make things easier.  Though it does not tell us which pitchers walk less and give up more hits, or vice versa, or tell us how many “empty innings” a pitcher had (innings where no baserunners got on), it does provide a valid average of baserunners to expect in a given game since it does not equate to a per-9 inning scale.

  • if WHIP 1.00-1.15, +3
  • if WHIP 1.16-1.25, +2
  • if WHIP 1.26-1.30, +1
  • if WHIP 1.31-1.40, 0
  • if WHIP above 1.40, -2

Instead of using K’s, I wanted to use the ratio of strikeouts to walks, since not every pitcher is a strikeout pitcher.  Even so, you do not have to be a strikeout pitcher to be an accurate one, and because of this I rewarded those with high K:BB ratios.  Greg Maddux only struck out 104 in 34 starts, but only walked 25 – a K:BB of 4.16.  This meant that Maddux kept more runners off-base by striking them out and not walking them.

  • if K:BB above 4, +7
  • if K:BB above 3, +5
  • if K:BB above 2, +3
  • if K:BB above 1, 0
  • if K:BB 1 or below, -3

Now that we have the points, let’s test it out and put it to use.  We will use Ian Snell and Carlos Zambrano.
The table below shows Ian Snell’s 2007 numbers and points he receives for each in my points system.

Starts 32 +5
Innings 208.0 +5
Cheap W 0 0
Tough L 4 +8
Legit W 9 +18
Legit L 8 -16
ND-AQS 11 +11
AQS % 75% +5
IP/Game 6.52 +7
WHIP 1.33 0
K:BB 2.60 +3
CG 1 +2
SHO 0 0

When we add up all eleven of these numbers, we get Snell’s Effectiveness #, which comes to: +48.
Now, let’s look at Carlos Zambrano’s season numbers in the table below and add his point totals up.

Starts 34 +5
Innings 216.1 +5
Cheap W 0 0
Tough L 2 +4
Legit W 18 +36
Legit L 11 -22
ND-AQS 0 0
AQS % 53% 0
IP/Game 6.36 +5
WHIP 1.34 0
K:BB 1.75 0
CG 1 +2
SHO 0 0

We look at his numbers and add up the totals to get his Effectiveness #: +35.
Zambrano had more legit wins but also more legit losses, and of Zambrano’s 3 no-decisions, none were ND-AQS, whereas of Snell’s 11 no-decisions, all were ND-AQS. 
That tells us that if each player got a win for every game he pitched well, and a loss for every game he did not pitch well (did not get an AQS), and the only no-decisions they received came from no-decisions that they pitched poorly in or did not go a full 6 IP, their records would look like this –

  • Carlos Zambrano (18-13) would actually be 20-11
  • Ian Snell (9-12) would actually be 24-8

Snell went further into his games, had a better K:BB ratio, and had that higher AQS %.  It also tells us that of Snell’s 32 starts, 24 of them were of great quality, whereas Zambrano had 18 good-great starts and 16 average-bad starts.
This essentially tells us that while Zambrano’s good-great starts may have been better than Snell’s good-great starts, when Zambrano had his bad starts, Snell was still having good-great ones.
As mentioned before, I used this points system to evaluate 30 National League pitchers.  I compiled a group of spreadsheets, ranking the pitchers in order in different categories to show that certain stats we rely on do a bad job of proving effectiveness.
To view all of my results, click on the links below.  You can use this data in other areas, but please credit my work.

  • To see the list of pitchers and their statistics used to assign points, click here.
  • To see the list of pitchers in order of effectiveness points, click here.

I do not want to post a ridiculously long table on this article, so you will need to look at the linked files to see the results, but I will list the top 15 pitchers and their effectiveness points.

  1. Jake Peavy, +74
  2. Aaron Harang, +69
  3. John Smoltz, +69
  4. Brandon Webb, +67
  5. Cole Hamels, +65
  6. Brad Penny, +64
  7. Tim Hudson, +63
  8. Ted Lilly, +60
  9. Matt Cain, +52
  10. Roy Oswalt, +50
  11. Ian Snell, +48
  12. Bronson Arroyo, +47
  13. Derek Lowe, +47
  14. Greg Maddux, +45
  15. Adam Wainwright, +45
  16. Jeff Francis, +45

And, again, these points were assigned to statistics based on how important they corrolate to effectiveness.  The points system essentially covers the statistics and averages from all angles.
The most shocking part of this was how low Chris Young of the Padres came out.  Young went 9-8, with a 3.12 ERA, in 30 starts.  He should have been more effective, I thought, based on those numbers.  After looking at his game logs, though, I changed my mind and realized it made sense.
Of his 30 starts, he was essentially two different people.  In the 19 starts in which he went for 6+ innings, he was 9-1 with a 1.64 ERA, averaging 6.6 IP/gm, with a 0.85 WHIP and 129 K’s in 126.1 innings.
In the other 11 starts, he was 0-7, with a 7.14 ERA, only going 4.2 IP/gm, with a 1.76 WHIP, and 38 K to his 36 BB, in 46.2 innings.
After analyzing his situation and the points system I realized that my effectiveness model favors consistency and lower standard deviations (the average of how far someone strays from his average).  To me, that truly defines effectiveness.
I would much rather have a guy who I knew would amass an AQS 67% or more of the time than a guy who might strikeout 20 batters and pitch a two-hitter in one game, but give up 5 runs in 6 innings for the next three, before again pitching a brilliant game.
As long as the consistency is of a good nature, consistency in this model proves effectiveness.
I know, we’re finally at the end of the article, right?  I apologize for the length but it took this long to get everything across. 
Looking at Jake Peavy, the most effective NL pitcher at +74, we see that the only counted statistic in which he led was AQS.  Peavy had the most good-great starts of any NL pitcher.  While he may not have led in IP, IP/gm, K:BB ratio, or least losses (Brad Penny only had 1 legit loss), he led in consistency and being consistently good-great.
These results also show that Cole Hamels, with 6 more starts that he missed due to injury, would likely challenge Peavy for #1 in effectiveness – however, as my model dictates, the fact that he missed those 6 starts and Peavy did not shows that Peavy was more effective.
Yes, there were more stats we could add to this, and more variables to account for, but I feel this accurately levels the field of play between pitchers in distinctly different playing situations, and levels the difference between 2007 reputation and 2007 actual performance.
I must remind you before I come to a close, though, that this is only a measure of effectiveness, not the end-all solution to determining who the “best” pitchers are.
However, for this Sabermetrician, effectiveness directly corrolates with quality and value.

Stats 204: The proximity matrix OR Re-visioning similarity scores

I suppose that when Bill James invented the similarity score, it was an attempt to say “Who exactly is this guy like?”  Is he the second coming of Joe DiMaggio (the power hitter who never strikes out), or is he the second coming of Dave Kingman (the power hitter who strikes out a little more often)?  Maybe he’s the second coming of Tommy Hinzo.  How can we tell.  Mr. James put together a formula that attempted to answer exactly that question.  The formula itself is based on a fairly simple system of “start with 1000” and subtract points for differences in various statistical categories.  It’s not an awful system and generally produces some decent comparisons, but mathematically, we can do better than that!
Let’s pretend that there are only two stats in baseball that matter: walks and strikeouts.  We might use raw numbers of BB and K, but it makes more sense to put them into rate form.  We might classify players, in a very rough way, as being players who neither walk nor strikeout much, players who walk and strikeout a lot, players who strikeout a lot, but don’t walk much, etc.  If we want to get more fine-grained, we can start saying medium or medium-low, etc.  Or if we want to find the player whose BB and K rates match most closely, we can start digging through the data.  If Player A strikes out 15% of the time and walks 7%, then Player B who strikes out 14.8% of the time and walks 7.1% is a good match.  Player C who strikes out 23% of the time and walks 5% isn’t a good match.  But, how good a match… or a non-match is he?  And what do we do when we get beyond two stats of interest.  How do we account for walks, strikeouts, and home runs, singles, or anything else for that matter?
Enter the proximity matrix.  Let’s go back to our “walks and strikeouts only” example.  We could plot walk rate and strikeout rate on a standard two-dimensional axis (graph paper), and label all the players.  They we could measure (with a ruler!) which player is the closest to any other player.  That works great when there’s only two variables.  Three dimensional graph paper (for three variables) is harder to come by, and by the time we get to four variables, well now we’re into hyperspace.  (Yes, I love Star Trek too.)  Fortunately, mathematics isn’t bound by such constraints, and it’s possible to calculate the distance between a point in four (or more, there’s no limit) dimensions.  It’s called the squared Euclidean distance.  In fact, we can get a matrix of how far away every player in our sample is away from every other player.  That’s the lovely thing about computers, they do all the heavy lifting, and do it in rather short order. 
And we can use whatever criteria or stats are of interest.  Want to look at player height and weight?  Want to look at career OBP and SLG and do it up to age 29?  Want to include every major leaguer ever?  Want to look at projected stats?  That’s fine.  Your CPU will groan a little more, but it can be done.  It’s just an engineering problem.
So, let’s run a little example.  Let me take the 2007 seasonal stats and calculate K rate, BB rate, and HR rate (all per PA), and BABIP.  I kept it to those hitters who had 200 PA or more (even though I spent way too much time arguing that more than 200 PA were needed for BABIP to be reliable enough to use… I’m just illustrating here), leaving me with 341 players.  I asked my computer to give me a proximity matrix.  (Technical note: I re-scaled everything to a range of -1 to +1, which mathematically makes things better.)
Then I tried to post this matrix so that everyone could see it.  The problem is that only 256 variables can be put into an Excel file (there are 341 players here), and when I tried to post it as pure text, the file reached 578 KB in size.  Google docs has a limit of 500 KB for text files.  If anyone wants the document, just e-mail me.  I prefer to keep everything I do open-source.
To give you an idea though of how it might work, and again only using the four stats above (more on that in a minute), let’s look at recent free agent debate-starter, Torii Hunter.  Whom, in terms of 2007 performance, did Torii most resemble?  Hunter hit a HR 4.3% of the time, struck out 15.5% of the time, walked 6.2% of the time, and had a BABIP of .306
Top 5 matches:

  1. Adrian Beltre (4.1%/16.3%/5.9%/.297)
  2. Brandon Phillips (4.3%/15.5%/4.7%/.307)
  3. Alex Gonzalez (3.7%/17.4%/5.6%/.301)
  4. Damien Easley (4.6%/16.1%/8.7%/.297)
  5. Ryan Garko (3.9%/17.4%/6.3%/.322)

You’ll notice that none of those gentlemen are center fielders by trade, which is something that James’s system does take into account, however imprecisely.   It’s my understanding that a categorical variable (primary position) can be entered into the matrix and that can be controlled for.  (I used hierarchical clustering… I believe that would be two-step clustering.)
Now, I picked these four stats because they were easy to calculate and they do a decent enough job of encapsulating a player’s performance over a year, and that was all I needed for a quick example.  I’m fully expecting that the careful reader out there is already thinking “But those aren’t the best 4 stats.  You need to include/take out/replace….”  And that’s fine.  In fact, I’m counting on it.  It’s an interesting question.  What suite of stats would work best in here?  What stats would fully encapsulate a player’s abilities?  In other words, when you compare a player to some other player, what type of criteria do you use to make the comparison?  Does it depend on the question you’re trying to answer?  Pitchers?  Defense?  Hmmm…

The playoffs, The Gambler's fallacy, and The 50-50-90 rule

One of the basic rules of statistics applied to last night’s Game 7 of the ALCS last night:

The playoffs, The Gambler’s fallacy, and The 50-50-90 rule

One of the basic rules of statistics applied to last night’s Game 7 of the ALCS last night:  The 50-50-90 Rule.  If there’s something that’s a 50/50 shot for the team for which you are cheering, your team will lose 90% of the time.  That is, unless you’re a Red Sox fan.  But I grew up in Cleveland and my first coherent memories are of watching the Cleveland Indians.  (True story.)  This is an iron-clad rule of statistics.  You can look it up.
I work in a hospital, and in the emergency room, they have a measure called “Subjective Units of Discomfort” (SUDs), to measure people’s level of pain when they come in.  It goes from 1 to 10.  Being a practicing Sabermetrician and a psychologist, I felt the best way to cope with this turn of events would be to make a new statistic that would adequately capture the magnitude of what happened.  I thought about calling it Pizza Cutter Depression Probability Added (this was a particularly high leverage game for that particular stat).  Finally, I settled on Subjective Units for Cleveland Knockouts (I’ll let you do the acronym).  The formula is Opponent wins x 2.5.  The scale goes from 0 to 10.
But then again, I should have known it was coming.  My wife, who’s never wrong (and the sentence should end right there, just ask her… although she did wonder out loud why Travis Hafner wasn’t trying to steal third), said this morning that she had a feeling the Indians would win.  She’s had a hot hand on picking these 50/50 shots, but this morning, we found out that one of her picks for the sex of one of the 457234 babies that are being born to people we know within the next few months was wrong.  (She called boy.  They’re having a girl.  She also picked a Cubs-Angels World Series)  Looks like her hand has gone cold.
With that said, I would warn fans of the Red Sox and Rockies to watch out for a (real) property of statistics: The Gambler’s Fallacy.  Consider the simplest of all games of chance: the flip of a coin.  Suppose that you flip a coin ten times, and ten times in a row it comes up heads.  What are the chances that the next flip will be tails?  Did you say something other than “Fifty percent?”  Did you mumble something about the “Law of Averages?”  Sound like a baseball team about which Dane Cook has been yelling all week?
Red Sox fans will probably be saying to themselves that they are sure to win the World Series because the Rockies are “due to lose.”  Rockies fans will probably be saying to themselves that they are “on a roll” and will win the World Series because of momentum.  Of course, one of them will be proven “right” in the next week and a half.  In fact, neither one is right.  The Red Sox had a better regular season record and a better Pythagorean record, plus they have four games at home to the Rockies’ three.  So, the Red Sox are the favorites.  But each game starts at 0-0, so the probabilities of either team winning reset themselves after each game.
Now, the other thing that will be bandied about is that “In a short series, anything can happen.”  This is a nice way of saying that a seven-game series is an inadequate sample size from which to determine the relative quality of the two teams.  Which is true.  If I were to submit something to a scientific journal with an N = 7, I would have the paper sent back to me with a laugh.  In baseball, you get a trophy for your efforts.  Still, there’s a part of me that wishes that the Indians were part of that inadequate sample size of independent events.
After the game, my wife, in an attempt to console me, said that she didn’t think of it so much as losing a series, but re-gaining a husband.