Adjusted W-L: A Study of the Unlucky

If you have read any of my work on Starting Pitchers and SP Effectiveness it will come as no surprise that I strongly dislike Win-Loss records. 
In the 2005 season, Johan Santana posted the following numbers-

  • 16-7 actual W-L
  • 2.87 ERA
  • 7.02 IP/gm
  • 231.2 IP
  • 0.97 WHIP
  • 5.29 K:BB
  • 3 CG/2 SHO
  • 33 Games Started

In 2005, Bartolo Colon won the AL Cy Young Award.  Any idea of how many of the above categories, which we all intuitively equate to pitching effectiveness, Colon outranked Santana in? 
One.  One category.  Colon beat Santana in only one category in 2005.  Care to venture a guess to which it was?  Combine my sarcastic tone with the title/first line of this article if you need help.  That’s right.  The one category he outperformed Santana in was WINS, 21-16.  Santana outperformed Colon in every other statistical category in 2005 and somehow lost the Cy Young.  Not to take anything away from Colon’s season but he clearly did not perform better than Santana in any category other than wins and they had the same number of starts.  And to say that the Angels made the playoffs strictly because of Colon is just slightly over borderline ridiculous. 
For reasons unbeknownst to me, W-L has become an extremely significant barometer when measuring the quality of a season and of a career.  We invest a ton of stock into a statistic that paints us half of a whole portrait.  Ask yourself this – what does a W-L record tell us?
Does it provide a ratio of how often someone pitched well to how often he didn’t?  No, because a Win does not always equate to a well-pitched game and a loss does not always equate to a poorly-pitched game.
Does it take into account the fact that some teams score more than others?  No, because you get credited with a win if you last at least five innings and your team never relinquishes the lead once you leave.  It does not matter if you give up six runs in seven innings as long as you meet that above criteria.
A few weeks back I introduced my statistic, AQS – Adjusted Quality Start, which refers to when a pitcher either goes 6+ IP while surrendering 3 or less earned runs or 7.2+ IP while surrendering no more than 4 earned runs.  Using the AQS allows us to find the ratio, mentioned in the question above, of how often a pitcher performed well in comparison to not performing well.  Regardless of whether or not you received the deserved decision, or whether or not you even received a decision, if you meet the criteria of an AQS it means you pitched well and, in theory, deserve to win.
Springboarding off of the AQS, I began to separate W-L records into what they really were – a combination of Cheap Wins, Tough Losses, Legitimate Wins, and Legitimate Losses.  The legitimate decisions refer to games that a pitcher either recorded an AQS, and won, or did not record an AQS and lost.  The reverse can be said for the Cheap Wins/Tough Losses.  Failing to record an AQS and getting a win really should not happen and the same can be said for garnering a loss while recording an AQS.
I will use the 2007 season of John Smoltz to put this to use.  By all accounts he had a great year but he often gets lost in the Peavy/Webb shuffle when discussing the best in the NL this past season.  Peavy won 19 games, Webb won 18, and Smoltz only won 14.  Something deep down tells us that Smoltz had a better season than his 14-8 record would indicate, but how much better?
Looking more closely at his 14-8, we see that he had 0 Cheap Wins, 5 Tough Losses, 14 Legit Wins, and 3 Legit Losses.
If we take the Cheapies and Toughies out, Smoltz is left with a 14-3 record of legitimate decisions.  I want to go a bit further, though, because he recorded 22 decisions no matter how we look at it.  He legitimately deserved to go 14-3, but there were five games he lost that he pitched well enough to win.
With that in mind, I began to adjust the W-L records of pitchers and see what would happen if they were credited with a Win for every Tough Loss and a Loss for every Cheap Win, on top of the Legit Wins and Legit Losses.
When we apply that to Smoltz, his 2007 Adjusted W-L would be 19-3.  When we do the same to Peavy and Webb we get a 21-4 record for Peavy and a 20-8 record for Webb.
Essentially, Smoltz should have won 19 of his 22 decisions, Peavy should have won 21 of his 25 decisions, and Webb should have won 20 of his 28 decisions. 
If we are going to use W-L record as a barometer of quality, then we should use this Adjusted W-L instead since it actually does give us the ratio of how many times a pitcher performed well relative to the decisions he received.
Below is a table featuring the Actual W-L records and the Adjusted W-L records of some NL pitchers from 2007.





Jake Peavy




John Smoltz




Cole Hamels




Brad Penny




Tim Hudson




Ted Lilly




Matt Cain




Ian Snell




Dontrelle Willis




Adam Eaton




As we can see, Brad Penny had the best Adjusted W-L of any NL pitcher as he truly deserved to lose only one of his decisions.  If he received proper run support and was a bit luckier in the games he recorded decisions, he would have posted a 19-1 record.  I wonder if it would have been a different Cy Young picture if he did. 
Look at the cases of Matt Cain, Dontrelle Willis, and Adam Eaton.  Cain finished the season with an actual W-L of 7-16, even though he deserved to go 16-7.  That means he was unlucky nine times.  Dontrelle Willis should have been 15-10 even though he ended up 10-15, meaning he was unlucky five times.  Yes, by all accounts Dontrelle had a down season, but he did really deserve to win 15 of his decisions. It was just how bad his 10 deserved losses and no-decisions were that turned his season upside down.
On the flip-side, Adam Eaton finished the season 10-10, even though he deserved to be 6-14.  While Cain and Willis were very unlucky, Eaton turned out to be lucky four times.
When we look at the number of Cheap Wins and Tough Losses, we can subtract the difference, express it as a + or – number and detail which pitchers were the luckiest and unluckiest.  This is a bit different than the Pythagorean Formulas used to determine what a team’s record should be.  The team formulas look at the season, as a whole, and provide estimates as to what an overall record should be based on how many overall runs are scored and given up.
It does not make sense to use that here, because if a pitcher gives up 10 runs in Game 1, and 1 Run in Game 2, the average would come out to two bad starts, even though the starts are completely separate and the damage was done in one game.  The team formulas evaluate the entire forest without looking at each individual tree.
Looking at each individual tree needs to be done to really show which pitchers were luckiest and unluckiest.
In the case of Cain, he had 0 Cheap Wins and 9 Tough Losses.  Net Luck = 0 – 9, meaning that Cain had a Net Luck Rating of -9, or in other words was very unlucky.  There were no recorded Wins that he should have lost but there were nine recorded losses he should have won, or at least not recorded a loss.
Adam Eaton had 5 Cheap Wins and 1 Tough Loss.  5 – 1 = 4.  Eaton’s Net Luck was +4, meaning he was lucky four times.  Positive numbers correspond to being lucky, negative numbers correspond to being unlucky, and 0 corresponds to receiving exactly what you should have received.
Aaron Harang was 16-6 with 0 Cheap Wins and 0 Tough Losses.  He had a great season and deserved to go 16-6 in his decisions.  He would have a Net Luck Rating of 0, since he was not lucky or unlucky.
When pitchers tie in either luck or lack of luck the statistic we should look to is AQS %, which refers to the percentage of times a pitcher recorded an AQS.  With lucky pitchers, a lower AQS % tells us they pitched well less, and so they are luckiest because they recorded the most amount of Net Luck while pitching well the least amount of time.  For unlucky pitchers we look at the highest percentage because it tells us that the pitcher was not only unlucky enough to lose games he should have won but that he also pitched well a higher percentage of times.
For instance, Scott Olsen, Adam Eaton, and Byung-Hyun Kim all tied with a +4 Net Luck Rating, meaning they were the luckiest NL pitchers.  Olsen had an AQS % of 33.3, Kim at 27.3, and Eaton at 26.7.  Therefore, Adam Eaton was the luckiest NL pitcher because he received four positive decisions that were unmerited and pitched well the least amount of time.
Though Cain, Bronson Arroyo, and Derek Lowe all ranked higher than Dontrelle and Smoltz, the latter two finished at -5.  Dontrelle had an AQS % of 57.1 and Smoltz at 84.4 %.  Therefore, Smoltz was unluckier than Willis because he received five negative decisions that were unmerited and pitched well way more often.
When we apply Net Luck to every pitcher in 2007, in both the NL and AL, we get the following results –

  • Luckiest NL SP = Adam Eaton (PHI), +4
  • Luckiest AL SP = Odalis Perez (KC), +4
  • Unluckiest NL SP = Matt Cain (SF), -9
  • Unluckiest AL SP = Dan Haren (OAK), -6

Though Haren pitched well and still finished 15-9, he should have been 21-3.  Odalis Perez actually tied Felix Hernandez of the Mariners at +4, but Hernandez’ AQS % was 57.1 whereas Perez came in at 30.8.
Honorable Mentions for Luck in 2007 go to:

  • Scott Olsen, +4
  • Byung-Hyun Kim, +4
  • Paul Byrd, +3
  • Boof Bonser, +3
  • Jeremy Bonderman, +3

Honorable Unlucky Mentions in 2007 go to:

  • Bronson Arroyo, -7
  • Derek Lowe, -6
  • John Smoltz, -5
  • Mark Buehrle, -5
  • Gil Meche, -5
  • Dontrelle Willis, -5

Though I do not have all of the data compiled right now, something I am going to investigate over the next few weeks are which pitchers, from 2000-2007, have been the luckiest and unluckiest.
Another usage of Net Luck that fascinates me, and that I am currently researching for my book, involves an application to 300 game-winners, as well as those who are close.  Something tells me that I will find some guys with 300 wins who maybe should not have 300 wins, as well as some guys who are short of 300 that really should have it.  After all, if we are going to use 300 wins as a Hall of Fame barometer, we should at least make sure the wins are deserved.
I am currently involved in conducting this research and if anyone would like to help, please get in touch with me.


2007 American League SP Analysis

A couple of weeks ago, I presented the Seidman SP-Effectiveness Model, which took into account a large majority of statistics that deem a pitcher to be effective and weighted them with points based on how important/rare they were.  The system is designed to take into account various factors that need to be taken into account in order to level the field of play between those on good or bad teams, those with or without run support, and those either called up/injured or those just plain bad.
Not surprisingly at all, Jake Peavy ended up being first, five points ahead of his competition, but the order of those that followed him turned out to be a bit more surprising than I thought.  Everything made proper sense, though, because the pitcher cannot be blamed for his team not scoring for him or not getting decisions in brilliantly-pitched games.
Essentially, my SP-Effectiveness Model answers the question – What would happen if a pitcher was rewarded every time he pitched well and negated every time he pitched poorly?
I also introduced my statistic, the AQS, or Adjusted Quality Start, which extends the general rule of 6+ IP and 3 or less ER to also include games of 7.2+ IP and 4 or less ER.  Based on my analysis of innings pitched by starters and the frequency of when they were lifted for relievers, coming one out short of the eighth inning truly merits being allowed to give up that fourth run.
If you have not yet read the NL Article on this same subject, I highly suggest you click the below link – that way you will understand the rubric and reasoning.
To read the NL 2007 SP-Effectiveness article, and see the results, click here
In this article, I am applying my model to 2007 American League pitchers.  Just like the NL, there were some expected results, as well as some initially peculiar results that make sense upon further thought.  Additionally, just like with my NL post, I did not apply this to every American League pitcher.  Instead, I selected 1-3 pitchers from each AL team.  Before the 2008 season begins I will plug every pitcher from both leagues into my system to see who was worst – which is always fun.
I will not explain all of the statistics or points values, since I did that in the previous post on the NL, but I will say that I did consider the fact that AL managers did not have to worry about pinch-hitters.  Due to this, I considered making the IP requirements more stringent with the AL, but the fact is that even though they do not need to be removed for pinch-hitters, they are facing an extra offensive player (not a pitcher in the 9th spot).  They should, in theory, give up more runs and have just as good of a reason to come out of a game.
Overall, though, only a few more AL pitchers had over 225 IP than NL pitchers and so it was not worth changing.  The biggest difference in both leagues was the average IP/game of the selected pitchers.  AL starting pitchers accounted for 66.2% of the total IP in 2007, whereas NL starting pitchers accounted for 63.5%.  Though the numbers are pretty close, when we are dealing with over 23,000 IP in a league that extra 2.7% equates to approximately 600 IP.

  • To view the raw statistics of all the pitchers used, click here.
  • To view the list of AL SP used in the order of effectiveness points, click here

Again, if you wonder why certain statistics are used and/or why they were assigned certain points, please read the previous NL article linked at the top of the page.
I do not want to post a table of 28-30 pitchers, so you will have to click the link to view the results spreadsheet, but I will list the top ones below.

  1. CC Sabathia, +84
  2. Dan Haren, +76
  3. Fausto Carmona, +74
  4. John Lackey, +72
  5. Roy Halladay, +68
  6. Johan Santana, +60
  7. Mark Buehrle, +59
  8. Josh Beckett, +58
  9. Justin Verlander, +58
  10. James Shields, +57
  11. Javier Vazquez, +57
  12. Kelvim Escobar, +57
  13. Joe Blanton, +57

In the National League, the odd ranking was Chris Young, whose barometrical statistics suggested he should have been ranked higher.  In the AL, Beckett falls into the same category. The issue here has nothing to do with Beckett’s numbers, but rather the fact that there were other pitchers who were not as lucky as he was in getting run support or solid bullpen help. 
Of the players listed above Beckett, both Santana and Haren had 7 tough losses, Buehrle and Lackey had 5 tough losses, Halladay led MLB in IP/gm and CG, and Carmona had more legit wins and less legit losses.
Essentially, there is nothing wrong with Beckett’s 2007 numbers, however there were other pitchers who happened to perform better in certain areas than he did.
The Red Sox had a dynamite bullpen, so going to Okajima or Papelbon was something that just about any manager would feel comfortable and justified in doing, whereas some of these other teams needed their starters to last longer. 
No, this system does not take into account any sort of clutch factor, where I am sure Beckett would excel, but it does level the playing field to show which pitchers were the most effective, based on the numbers they individually put up. 
Just like the conclusion that was made in the Snell/Zambrano comparison, this is all about consistency.  The quality of Josh Beckett’s AQS’s may have been far greater than those of the other pitchers, however they occurred less frequently compared to the same other pitchers.   Even though his good-great games may have been astounding, when he was having average or bad games, the other pitchers were still having good-great games.
Beckett had an AQS 67% of the time (20 of 30 starts) while those listed were 73% and higher. This is not necessarily a measure of how good a pitcher was in his good games, but rather how often he was good.
One of the major reasons we considered Beckett to have been so good this past season was his record.  If he was only 15-9, like Dan Haren, there would not have been a Cy Young debate. 
That tends to be a problem because, as I will get into in the next category, W-L records do not differentiate between these Cheap Wins and Tough Losses.  If we gave every pitcher a Win for each Tough Loss, and a Loss for each Cheap Win, Beckett’s record would not have been 20-7.  It would have been 19-8. 
There is not a huge difference between his 20-7 and 19-8, but when we do the same for the AL pitchers above him in points, we get the following records: Sabathia (21-5), Carmona (23-4), Santana (19-9), Haren (21-3), Buehrle (15-4), Lackey (22-6). 
If we are going to use W-L record as a barometer, and include these Tough Losses and Cheap Wins, all of those above records are either better than or equivalent to Beckett’s 19-8.
Based off of just looking at the Adjusted W-L records, if we were to use that as the barometer for the Cy Young Award or the best pitcher, the debate would not be between Sabathia and Beckett – it would be between Haren and Carmona.  I am not saying it should have been between Haren and Carmona, but rather that if we are going to use W-L as an “end-all” statistical solution, we should at least use the Adjusted W-L, or the True W-L.
I described the different types of wins in the NL article but I did not mention the statistic “True W-L Record.”  In order to properly evaluate pitchers, W-L records have to be broken down and examined.  Some pitchers will get tremendous run support and win games even if they only last 5.1 innings and give up 4-5 runs. 
Then there are some who will go 6.2-7.1 innings, give up 2-3 runs, and lose.  After separating these Cheap Wins and Tough Losses from a W-L record, we are left with a record of legitimate wins and losses – games that a pitcher deserved to win or lose based on performance. 
A legit win occurs when you record an AQS and win, and a legit loss occurs when you do not record an AQS and lose.
The difference between True W-L and the Adjusted W-L I used in the Beckett comparison is that the True W-L does not include Cheap Wins or Tough Losses.  True W-L only includes games in which the pitcher recorded a win or loss when either decision was merited.
You can see these True W-L Records in the raw statistics spreadsheet, but I have listed the best ones below.  In parenthesis next to the True W-L Records are the Actual W-L Records.

  • Dan Haren, 14-2, (15-9)
  • Kelvim Escobar, 14-3, (18-7)
  • Fausto Carmona, 18-3, (19-8)
  • Josh Beckett, 17-5, (20-7)
  • Chien Ming-Wang, 17-5, (19-7)
  • CC Sabathia, 17-5, (19-7)

Again, we see that if win-loss was to be the “end-all” tool to evaluate a Cy Young Award or the best pitchers, Haren and Carmona would be atop the list.
For fun, I decided to plug some legendary seasons into my system to see what the end results were. Yes, it is impossible to perfectly compare a season from 1966 to one from 1996, but still it is interesting to see how they would rank. To do this, I took the 1968 season of Gibson, the 1995 season of Maddux, and the 2000 season of Martinez. The points results for the three were:

  • Bob Gibson, 1968, +178 pts
  • Pedro Martinez, 2000, +104 pts
  • Greg Maddux, 1995, +97 pts

And there you have it.  By the middle of February I should have a spreadsheet/PDF made up of all NL and AL pitchers plugged into this effectiveness model.  That way we can see who were the absolute worst as I am sure we will find some surprises and unexpected names there.
The biggest surprises to me in both leagues, in a positive turn, were Bronson Arroyo and James Shields.
The most unexpected finishes were Beckett and Chris Young, as I predicted they would be higher.
An interesting thing to look at is how players on the same team ranked next to each other.  In the NL, Zambrano is widely thought of as the #1 of the Cubs, yet Ted Lilly finished much higher.  In the AL, Kazmir is definitely thought of as the Rays ace, yet Shields ranked 9th out of the pitchers used here, and Kazmir finished 20th.
And, since the Yankees have to be stubborn, both Pettitte and Wang tied in effectiveness points. 
This model is not the end-all solution to determining who the best pitchers are in a given year, but it is a darn good predictor and estimator since it equalizes the field of play and makes sure it is known that you do not have to be on a great team to be a great pitcher or have a very effective year.  
This measures a specific season, where some players may be better than others, even if they are nowhere near better in a retrospective look at their careers. 

2007 NL Starting Pitching Analysis

When it comes to analyzing and comparing pitchers, those conducting the comparisons will often find themselves in a tricky situation.  Sure, certain pitchers are better than others, but what are they specifically better at? 

How can we conduct an honest analysis when there are so many variables to consider?  And how can we truly determine which pitchers were better than others when some are on terrible teams with no run support and others are on tremendous teams with tons of run support?
The first step is to determine what we are measuring.  If we want to know who the best strikeout pitcher is, we should look at the raw total for strikeouts and also an average of K/IP, since some guys will make less starts than others.  To figure out who walks the least, we measure the number of walks each pitcher gives up and a walk-IP ratio.
These measurements are contingent on one category, though, and cannot tell us who is better or more effective than the rest.  All of the research and ideas presented in this article are designed to measure the “effectiveness” of a pitcher. 
In order to determine this effectiveness, a whole heck of a lot of numbers need to be measured and properly weighted/scaled so that everybody has a fair shot – whether or not they are on a great team.
I took the 1-3 best pitchers from each National League team and entered their statistics into a database, measuring everything from their raw Innings Pitched totals to their Adjusted Quality Start % (you’ll read more on that below).  After entering all of the statistics, and crunching numbers until my brain turned to mush, I came up with my weighted points system.  I assigned the corresponding point totals and added everything up to determine what I feel is a very accurate measurement of pitching effectiveness amongst the NL’s best. 
This was not applied to every single NL Pitcher in 2007 (I will do that another time) but rather amongst these 30 selected #1, #2, or #3 starters.  For instance, a guy like Jeff Suppan may have been more effective than Jason Bergmann but I wanted to have at least one person from each team.
The system is not 100% perfect and does not take into account every single statistic (do you know how many statistics there are??), but it definitely levels the playing field between those on good or bad teams, those injured/called up or just plain bad, and those who got lucky or unlucky with run support.  The points are assigned based on the areas I, as an intense student of the game, feel are most important to determine true effectiveness. 
The basic idea of this system is to measure the true quality of a pitcher over his season – IE, what would happen if a pitcher was rewarded every time he pitched well and discredited every time he pitched poorly – something that happens perfectly just about 0% of the time. 
We will begin by going over the statistics involved, what their points scale was, and why they are used.  The idea behind these corresponding point totals is to properly weight the areas in which most people intuitively attribute to success and quality.
The points given to each statistical subset are designed to separate the aces from the workhorses and the workhorses from the seemingly replacement level pitchers.  They may seem arbitrary and could be replaced with different numbers, or fractions/decimals, however the difference between the points in subsets was based on the amount of pitchers who fall into certain categories.
In order to be as effective as possible, a pitcher needs to make as many starts as he can.  How can we say that a pitcher with 14 starts is more effective than one with 34-35, even if his numbers in those 14 starts are tremendous and the numbers of the one with 34-35 are a bit worse?  His numbers may be better than the pitcher with 35 starts, however the latter pitcher was involved in 21 more games and proved to be durable enough to pitch an entire season, and solid enough to maintain his SP status for 162 games. 
This does not mean that a pitcher with 35 starts is necessarily “better” than one with 14-16, but rather he is more effective because he is involved in more of his team’s season. 
If the pitcher with 14-16 starts posted the same numbers in 32 starts, it would not be a contest.  But, he didn’t – it was only 14-16.  You cannot have as much of an effect on your team (actual play, not motivational or anything) unless you are out there as often as possible.
***What the end result of this effectiveness points system showed is that those with average numbers, over 30+ starts, were equally as effective, or slightly better/worse, than those with good numbers over 16-20 starts.***
If somebody makes only 14 starts in a season, it could be because he was injured for half of the season or was called up from the minors during the season, so he should not be penalized with negative points for that – he just should not be rewarded as highly as someone with 30+ starts.

  • if over 30 starts, +5
  • if 25-29 starts, +3
  • if 20-24 starts, +2
  • if under 20 starts, 0

Just like Games Started, IP can only get you positive numbers, because the low raw number of IP can be attributed to injury or a midseason call-up.  Those with more IP get higher point totals, though.  The reason for 0 points for under 100 innings is because you were not necessarily a bad pitcher, but the lack of innings (whether due to injury or a call-up) limits the effectiveness.

  • if 230+, +8
  • if 220-229, +7
  • if 200-219, +5
  • if 150-199, +3
  • if 100-149, +2
  • if under 100, +1

This is where negative numbers can begin.  If you were hurt, or called up from the minors, you are not penalized with negatives for the raw number of innings pitched or games started, but if you posted a high number of starts and low number of innings, this statistic will bite you in the rear.  IP/Game separates the hurt or called up from the downright below average or bad.  It also helps reward those with a couple less starts than others but with more raw innings pitched.  These types of pitchers were in the same GS range but some went deeper into games than others.  Nobody averaged over 7 IP/gm, so we start lower.

  • if 6.5-7 IP/gm, +7
  • if 6.0-6.49 IP/gm, +5
  • if 5.5-6 IP/gm, +3
  • if 5.0-5.5 IP/gm, 0
  • if below 5.0 IP/gm, -5

If you cannot average over 5 innings per game, or exactly 5 innings per game, you should not be a starting pitcher.  Even Adam Eaton averaged over 5 IP/gm in 2007.
Quality Starts can be an inaccurate statistic because it takes into account games in which a pitcher goes 6+ innings and gives up no more than 3 earned runs… and nothing else.
If a pitcher goes 8.1 innings and gives up 4 runs, it is arguably the same ratio and an equal game in terms of quality, but does not get counted as a quality start.
With that in mind, I came up with the stat of Adjusted Quality Starts, which takes into account all regular quality starts as well as games in which someone goes 7.2-9 innings and gives up no more than 4 runs.  This measures the true number of games in which a pitcher had a good-great performance.
***If you wonder why it is 7.2 IP, instead of 8, the number was derived from the amount of times a pitcher was lifted after 7.2 IP for a specialist, or other sort of reliever, and from the sheer low average of innings pitched/game by a starter this year.  Reaching the 7th inning is now a great feat, let alone coming within one out of finishing the 8th.  Though the previous ratio for a QS was 2:1, due to the data mentioned above, going an extra 1.2 IP to get to 7.2 IP merits being able to give up one more run.***
I used the percentage of AQS to the total number of Games Started to measure effectiveness in this area.  Someone over 75% almost always pitches a good-great game, whereas someone under 50% only pitches a good game less than half of the time – not very effective.

  • if AQS % is above 75%, +5
  • if AQS % is 67-74%, +3
  • if AQS % is 50-66%, 0
  • if AQS % is below 50%, -3

If you’re keeping score at home, AQS= 6+IP with ER =< 3, AND, 7.2+IP with ER =< 4, where =< is the blog version of greater than/less than or equal to. 
In addition to AQS, something that needs to be taken into account is how often a pitcher went for a complete game, since they are so rare.  We also need to take into account a shutout, since they occur even less. 

  • For every CG, +2
  • For every SHO, additional +1

***NOTE: Aaron Harang had two games in 2007, one where he went 9 IP, and one where he went 10 IP, when he did not get a decision.  Even so, I am counting these 2 as a combined 1 CG, since he went 9+ innings.***
W-L Records are the most deceiving statistics because they do not take into account the true quality of the games pitched.  Just because a pitcher goes 14-7 does not mean he was necessarily a great pitcher.  He could have pitched terribly and had great run support in 10 of 14 wins, but brilliantly with terrible run support in the 7 losses.
The whole point of the adjusted W-L records is to get an AQS, since that means you pitched well and should be rewarded, even if your team (offense or bullpen) does not help you. 
After all, Ian Snell cannot control the Pirates’ offense.  It is not his fault that 4 of his 12 losses were “Tough Losses” and all 11 of his No-Decisions were games in which he pitched brilliantly and had an AQS, yet he received little to no offense to help garner him a ‘W’.
With that in mind, I changed W-L to the following 5 stats:

  • Cheap Wins: wins in which one does not get an AQS (-1)
  • Tough Losses: losses in which one does get an AQS (+2)
  • Legit Wins: wins in which one does get an AQS (+2)
  • Legit Losses: losses in which one does not get an AQS (-2)
  • ND-AQS: no-decisions in which one gets an AQS (+1)

I received some questions for how these numbers came to be, and to keep it simple, the statistics that actually have an effect on the W-L record are valued higher (negatively and positively) than the statistics like ND-AQS, which prevent a pitcher from winning but do not hurt him with a loss.
ND-non AQS is not used here for the same reason that Cheap Wins is only negative one, which is that not every Cheap Win or ND-non AQS was a terrible start.  A large bulk of them were games in which a pitcher had a good outing but only went 5 or 5.1 innings.   Cheap Wins loses you a point (not two, only one) because you do not get an AQS but it does effect your win-loss record.  ND-non AQS means you do not get an AQS but it does not effect your win-loss record, which is why I decided to just leave it out.
Though I am not too fond of this statistic and originally tinkered around with separately evaluating H/IP and BB/IP, using WHIP just seemed to make things easier.  Though it does not tell us which pitchers walk less and give up more hits, or vice versa, or tell us how many “empty innings” a pitcher had (innings where no baserunners got on), it does provide a valid average of baserunners to expect in a given game since it does not equate to a per-9 inning scale.

  • if WHIP 1.00-1.15, +3
  • if WHIP 1.16-1.25, +2
  • if WHIP 1.26-1.30, +1
  • if WHIP 1.31-1.40, 0
  • if WHIP above 1.40, -2

Instead of using K’s, I wanted to use the ratio of strikeouts to walks, since not every pitcher is a strikeout pitcher.  Even so, you do not have to be a strikeout pitcher to be an accurate one, and because of this I rewarded those with high K:BB ratios.  Greg Maddux only struck out 104 in 34 starts, but only walked 25 – a K:BB of 4.16.  This meant that Maddux kept more runners off-base by striking them out and not walking them.

  • if K:BB above 4, +7
  • if K:BB above 3, +5
  • if K:BB above 2, +3
  • if K:BB above 1, 0
  • if K:BB 1 or below, -3

Now that we have the points, let’s test it out and put it to use.  We will use Ian Snell and Carlos Zambrano.
The table below shows Ian Snell’s 2007 numbers and points he receives for each in my points system.

Starts 32 +5
Innings 208.0 +5
Cheap W 0 0
Tough L 4 +8
Legit W 9 +18
Legit L 8 -16
ND-AQS 11 +11
AQS % 75% +5
IP/Game 6.52 +7
WHIP 1.33 0
K:BB 2.60 +3
CG 1 +2
SHO 0 0

When we add up all eleven of these numbers, we get Snell’s Effectiveness #, which comes to: +48.
Now, let’s look at Carlos Zambrano’s season numbers in the table below and add his point totals up.

Starts 34 +5
Innings 216.1 +5
Cheap W 0 0
Tough L 2 +4
Legit W 18 +36
Legit L 11 -22
ND-AQS 0 0
AQS % 53% 0
IP/Game 6.36 +5
WHIP 1.34 0
K:BB 1.75 0
CG 1 +2
SHO 0 0

We look at his numbers and add up the totals to get his Effectiveness #: +35.
Zambrano had more legit wins but also more legit losses, and of Zambrano’s 3 no-decisions, none were ND-AQS, whereas of Snell’s 11 no-decisions, all were ND-AQS. 
That tells us that if each player got a win for every game he pitched well, and a loss for every game he did not pitch well (did not get an AQS), and the only no-decisions they received came from no-decisions that they pitched poorly in or did not go a full 6 IP, their records would look like this –

  • Carlos Zambrano (18-13) would actually be 20-11
  • Ian Snell (9-12) would actually be 24-8

Snell went further into his games, had a better K:BB ratio, and had that higher AQS %.  It also tells us that of Snell’s 32 starts, 24 of them were of great quality, whereas Zambrano had 18 good-great starts and 16 average-bad starts.
This essentially tells us that while Zambrano’s good-great starts may have been better than Snell’s good-great starts, when Zambrano had his bad starts, Snell was still having good-great ones.
As mentioned before, I used this points system to evaluate 30 National League pitchers.  I compiled a group of spreadsheets, ranking the pitchers in order in different categories to show that certain stats we rely on do a bad job of proving effectiveness.
To view all of my results, click on the links below.  You can use this data in other areas, but please credit my work.

  • To see the list of pitchers and their statistics used to assign points, click here.
  • To see the list of pitchers in order of effectiveness points, click here.

I do not want to post a ridiculously long table on this article, so you will need to look at the linked files to see the results, but I will list the top 15 pitchers and their effectiveness points.

  1. Jake Peavy, +74
  2. Aaron Harang, +69
  3. John Smoltz, +69
  4. Brandon Webb, +67
  5. Cole Hamels, +65
  6. Brad Penny, +64
  7. Tim Hudson, +63
  8. Ted Lilly, +60
  9. Matt Cain, +52
  10. Roy Oswalt, +50
  11. Ian Snell, +48
  12. Bronson Arroyo, +47
  13. Derek Lowe, +47
  14. Greg Maddux, +45
  15. Adam Wainwright, +45
  16. Jeff Francis, +45

And, again, these points were assigned to statistics based on how important they corrolate to effectiveness.  The points system essentially covers the statistics and averages from all angles.
The most shocking part of this was how low Chris Young of the Padres came out.  Young went 9-8, with a 3.12 ERA, in 30 starts.  He should have been more effective, I thought, based on those numbers.  After looking at his game logs, though, I changed my mind and realized it made sense.
Of his 30 starts, he was essentially two different people.  In the 19 starts in which he went for 6+ innings, he was 9-1 with a 1.64 ERA, averaging 6.6 IP/gm, with a 0.85 WHIP and 129 K’s in 126.1 innings.
In the other 11 starts, he was 0-7, with a 7.14 ERA, only going 4.2 IP/gm, with a 1.76 WHIP, and 38 K to his 36 BB, in 46.2 innings.
After analyzing his situation and the points system I realized that my effectiveness model favors consistency and lower standard deviations (the average of how far someone strays from his average).  To me, that truly defines effectiveness.
I would much rather have a guy who I knew would amass an AQS 67% or more of the time than a guy who might strikeout 20 batters and pitch a two-hitter in one game, but give up 5 runs in 6 innings for the next three, before again pitching a brilliant game.
As long as the consistency is of a good nature, consistency in this model proves effectiveness.
I know, we’re finally at the end of the article, right?  I apologize for the length but it took this long to get everything across. 
Looking at Jake Peavy, the most effective NL pitcher at +74, we see that the only counted statistic in which he led was AQS.  Peavy had the most good-great starts of any NL pitcher.  While he may not have led in IP, IP/gm, K:BB ratio, or least losses (Brad Penny only had 1 legit loss), he led in consistency and being consistently good-great.
These results also show that Cole Hamels, with 6 more starts that he missed due to injury, would likely challenge Peavy for #1 in effectiveness – however, as my model dictates, the fact that he missed those 6 starts and Peavy did not shows that Peavy was more effective.
Yes, there were more stats we could add to this, and more variables to account for, but I feel this accurately levels the field of play between pitchers in distinctly different playing situations, and levels the difference between 2007 reputation and 2007 actual performance.
I must remind you before I come to a close, though, that this is only a measure of effectiveness, not the end-all solution to determining who the “best” pitchers are.
However, for this Sabermetrician, effectiveness directly corrolates with quality and value.

Managers and the Pythagorean Theorem

It’s been a while since I actually played around with some Retrosheet files.  Time for some good old-fashioned research, this time on a question that has bugged Sabermetricians (and some mainstream folks) for a while.  Do Pythagorean residuals tell us much of anything about a manager?
A little bit of set up: a little while ago, I published an article here in which I mathematically showed that pretty much all of the variance in Pythagorean residuals can be explained by three factors: a bias in the formula (about which we can’t really do anything), a team’s average margin of victory (teams which won a lot of close games outperformed) and a team’s average margin of defeat (teams which got blown out a lot also outperformed).  This wasn’t anything new.  Anyone who has stared at the formula for more than five minutes could have figured that one out.  But, I did find that a team’s average margin of victory (calculated only in the games it wins) was largely uncorrelated (r = .2) with it’s average margin of defeat.  This means that in order to explain Pythagorean residuals, we need to look into why teams win (or lose) close games and why they win (or lose) blowouts.  It’s been said that it’s the manager who makes the difference in a close game, presumably because he’s the one “pushing the buttons.”  So, a good manager would be good at winning one run games which would translate into a bump in outperforming Pythagorean residuals.  Right?
Read more of this post

Poor Matt Cain

Matt Cain was flat out awesome in 2007.  He pitched 200 innings, surrendering only 173 hits.  He also struck out 163 batters and posted a 3.65 ERA.  In fact, of pitchers with 200 or more innings, the only one who gave up less hits than Cain was Cy Young winner Jake Peavy.  And, Cain’s 3.65 ERA placed him tenth best in the entire National League.  Reread this paragraph and let the numbers sink in.
Matt Cain’s record in 2007 was 7-16.  7 wins and 16 losses!  Yes, that is correct!
Brad Penny, in 208 innings, gave up 200 hits and struck out only 135. And do you want to know his record?  16-4!
If that is not a clear-cut indicator of how win-loss records can be deceiving, we are going to take a microscopic look at Matt Cain’s 2007 campaign.  Afterwards, try and argue that his 7-16 season was not better than any NL pitcher not named Peavy or Webb.
The amount of hits a pitcher surrenders is an oft-overlooked statistic because most people want to know about WHIP (walks+hits/IP).  Usually WHIP’s average around the 1.4-1.5 mark.  Even if we look at Cain’s WHIP, something I wanted to avoid because I hate the statistic in this instance, it was 1.25.
I do not hate the WHIP statistic overall, but in a case like Cain, just examining the hits surrendered really shows an unhittable factor.  Batters may have reached base because of mistakes he made in walking them, but the fact that he allowed such a low amount of hits for the innings he pitched shows that hitters truly had a very tough time hitting him.
Think about this… in 32 starts, Matt Cain went 7+ innings fifteen times and allowed four hits or less ten times.
In April alone, he pitched 35 innings in 5 games, and gave up only 12 hits. 12!!  In 35 innings!  That is one hit for every three innings pitched.  And since he averaged seven innings per game, that means in April, he would give up an average of 2.3 hits per game.
More numbers oft-overlooked are quality starts, tough losses, cheap wins, and blown wins. You might already know about quality starts and can probably figure out what a blown win is, but the tough losses and cheap wins are great devices for determining what a pitcher’s TRUE win-loss record should be.
A quality start refers to when a pitcher goes for at least six innings and gives up no more than three earned runs.  Quality starts are a useful number because they let us know how often a starting pitcher put his team in a position to win the game.  Since most teams average 4+ runs per game, if a pitcher gives up three or less, his team should win the game barring unforeseen circumstances.
Cain twirled 22 quality starts out of 32 possible starts.  That is a quality start percentage of 69%.  This means that when Matt Cain pitched, the Giants had just about a 70% chance of winning, or knowing that they would be kept in the game.  Nobody had higher than an 80% quality start percentage.
To put those numbers in perspective, the only NL pitchers with a higher quality start percentage were Jake Peavy, Tim Hudson, Brad Penny, and John Smoltz.  That puts a 7-16 pitcher 5th in quality start percentage, meaning only FOUR other pitchers in the NL gave their team a better chance to win.
Next, we have tough losses.  A tough loss refers to when a pitcher makes a great start, or quality start, and ends up getting a losing decision.  As I mentioned before, since most teams average over four runs a game (the Giants averaged 4.2 in 2007), a pitcher who gives up only three runs or less should end up winning.
For Cain, that was not the case. Of his 16 losses, 9 were tough losses.  Nine times Cain gave up 3 or less runs while pitching 6 or more innings, and LOST.  Nine times!
In his 32 starts, the Giants only provided him with 3.3 runs per game.  And, if you subtract two big blowouts from that, where they scored 15 and 9, in the other 30 starts they provided him a grand total of 84 runs… or 2.7 runs per game, 1.5 runs less per game than their season average.
This would mean that, for Cain to win, he had to give up 2 or less runs.  Only, he did that 18 times out of his 32 starts, and only amassed a 5-7 record (with 6 no decisions) in that span.
This primarily occurred because even when Cain adapted to the lack of run support and prevented the other team from getting over two runs, the Giants also adapted and forgot how to score.  The Giants were shut out four times during Cain starts, scored only one run on another four starts, and only two runs on another five starts.  That adds up to thirteen starts where the Giants gave Cain a maximum of two runs.
Cheap wins refer to when a pitcher does not pitch very well but walks away with a win (SEE: ERIC MILTON).  Of Cain’s 7 wins, 0 were cheap.  All were legitimate wins.
Blown wins refer to when a pitcher leaves the game with a lead but does not get a decision because the bullpen blows the game.  I do not like to count these as no-decisions because extenuating circumstances prevented the pitcher from getting a decision.  The only decisions I like to count as legitimate ND’s are when a pitcher leaves a game while tied or losing and his team comes back to tie or win the game after he has left.
Of Cain’s 9 ND’s, 5 were blown wins.
I threw a lot of numbers at you.  Let’s summarize everything and let it sink in. 

  • Matt Cain pitched 22 quality starts out of 32 starts, a percentage topped only by four other guys. 
  • He gave up 2 or less runs on 18 different starts but somehow only went 5-7 during that span. 
  • Nine of his sixteen losses were games he should have won, due to pitching brilliantly, and twirling better than a quality start.
  • Other than two blowout games, the Giants only gave him 2.5 runs of offense per start.
  • He lost 5 wins thanks to the bullpen.

This is where we will put the numbers to use and generate what Matt Cain’s true 2007 season looked like, since it sure as heck was not a 7-16 season.
Of his 32 starts, 4 were legitimate no-decisions. He had 9 total ND’s, but as mentioned before, I only count games when a pitcher left while tied or losing as a legitimate ND.  That means he should have received 28 decisions this year, win or loss.  Of his 7 wins, none were cheap wins, so all were legitimate and remain counted.  Clearly he could not win cheap because the Giants never scored for him.
Of his 16 losses, nine were tough losses, and seven were legitimate losses.  I’m not trying to make the guy look like a superhero – there were times (seven times) that he really deserved to lose due to poor pitching or just not being on top of his game.
The starting pitchers of the 2007 Giants, other than Cain, had a quality start-win percentage of 76 %, meaning that when the other pitchers were on the mound and pitched a quality game, they won three out of four times.  If we apply that to Cain, he should have won 6 or 7 of those 9 tough losses.  If you take his own quality start-win percentage of 33%, he should have won 2 of the 5 blown wins.
So, he has 7 legit losses.  Add the remaining two losses from the tough losses (we’re counting 7 of his 9 tough losses as wins) and he has 9 losses.  Also add 3 more no-decisions from the blown wins I am not counting as wins, and he has 9 losses out of 25 possible decisions.
Now, add up his 7 legit wins with 2 blown wins and 7 tough losses being counted as wins, and you get what Cain’s 2007 numbers REALLY are – 16-9, 3.65.
16-9, with a 3.65 ERA is good enough for the top ten in Cy Young voting.  I cannot imagine many voters felt comfortable giving votes to a guy with a 7-16 record.
Point blank, the point is that win-loss records are often useless when determining the value or efforts of a pitcher, unless said pitcher shows consistency with it (SEE: GREG MADDUX). 
Look at the 2005 Cy Young Award situation.  Johan Santana went 16-7 with a 2.87 ERA, and led MLB with 238 strikeouts.  Bartolo Colon went 21-8 with a 3.48 ERA, with 157 strikeouts.  Colon won the award.  Voters HAD to have seen his win-loss record and voted accordingly, which makes no sense, since Santana clearly had the better season and would have had a better record if the Twins performed better.
Matt Cain should be thought of as one of the ten or fifteen best in the national league, whether he had a 7-16 record or not.  He cannot, and should not, be blamed for his team refusing to score runs when he pitched brilliantly.
 In fact, you could honestly make the case that the only NL pitchers who really posted better overall seasons were Peavy and Webb, and maybe Hamels.
Seems odd to say that a 7-16 pitcher was potentially the third or fourth best in the whole league, but that is because the Win-Loss record has, for some unjustified reason, become the barometer for measuring the value of a pitcher.
It is just a shame for a guy like Matt Cain, who is not earning the big bucks yet, and pitching leagues better than guys making 15-20 times his salary. 
Win-Loss records should not be put on the pedestal anymore unless the stats justify it. Peavy truly was 19-6 this year. Beckett truly was 20-7 this year.   Anthony Reyes truly was 2-14 this year.
Matt Cain was not truly 7-16 this year.

Still more Pythagorean musings

Things continue to get interesting on the SABR Statistical Analysis chatlist on the issue of those pesky Pythagorean over-achievers.  No less a luminary than the founder of the theorem itself, Bill James, has come up with a little study of his own on the subject of whether teams who under-achieve one year are more likely to under-achieve in the next year (and whether over-acheivers will over-achieve the next year)
In it, he takes the top 100 over-achievers and the top 100 under-achievers of all time (using the Smyth/Patriot/Pythagenpat formula).  He finds that the top 100 over-achievers continued to over-achieve in the following year, although their level of over-achievement dropped from an average of 8.3 wins to an average of 0.47 wins.  For the under-achievers, they too underachieved on average, but fell from 8.68 wins to 0.24 wins.  He comes to the conclusion that while the effect isn’t zero, although it must be pretty small.  (He also runs a matched groups design in Parts III and IV of his paper that made me scratch my head.)
What Bill is describing in his paper is a regression to the mean effect that doesn’t quite regress all the way to the mean.  Let me take a look at this using a slightly different and more complete method.  I took the database of all teams from 1901-2005 and calculated their actual and Pythagenpat winning percentages, plus the Pythagenpat residuals.  I did the same for the following year for each team and matched the two up.  This gave me 2084 team-seasons.  The year-to-year correlation for Pythagenpat residuals is .043.  The mean of Pythagenpat residuals is zero.
That means that, knowing nothing else, our best guess for next year’s Pythagenpat residual can be given as:
0.043 * This Year’s residual + (1 – 0.043) * mean.  Since the mean is zero, that term drops out.  8.3 wins above expectation in year one would have a year two expecation of .043 * 8.3 + (1 – .043) * 0.  The answer is .3569.  Bill found that the actual teams checked in at 0.47 in the next year.  (And if I had twenty minutes of your time, I’d explain why what I just did was playing really fast and loose with some rules of math to get that number… Suffice it to say, it’s good enough for the situation.)  The 100 under-achievers would have an expectation in year 2 of -.3724.  Bill got -.24.  So, no the effect size is not zero, just like the chances that I will be hit by a bus today on my way walking to work are not zero.  But, since I generally look both ways before crossing the street, those chances aren’t anything to worry about.  I wouldn’t worry about this effect either.
Bill also brings up another interesting question posed by Mike Emeigh as to which was the better predictor of next year’s actual winning percentage for a team: their current year’s actual winning percentage or their current year’s Pythagorean projection.  Since I had the data set sitting in front of me, it seemed a shame not to ask the question.
Correlation between Year 2’s Actual Winning Percentage and:
Year 1’s Actual Winning Percentage = .603
Year 1’s Pythagenpat Winning Percentage = .626
I even ran Cohen’s test for specficity of correlated outcomes, and Pythagenpat really is significantly better (t = 4.35, for the curious) at predicting next year’s record.  Not by a lot, but it’s the better bet.  Still, score another one for Pythagoras.

The triumph of Pythagoras

On the SABR Statistical Analysis Listserv, there’s been a great deal of chatter concerning the good old Pythagorean win estimator.  This year, as it seems happens every year, most teams finish around their estimates.  But, there always seems to be that one oddity and this year, it’s the Arizona Diamondbacks.  The Diamondbacks were outscored this year (712-732), and had a Pythagorean expectation around 79 wins, depending on exactly which formula you use.  They won 90 games, good for  the best record in the NL.  Huh?
So, are the Arizona Diamondbacks a sub .500 team, like their Pythagorean projection says or are they a 90 win team like their… ummm… actual record says?  It’s an interesting question.  When trying to figure out how “good” a team is, which should we look at?  This is a topic which has been taken up before by Chris Jaffe, specifically with reference to the Diamondbacks, and more theoretically a few years ago by Dan Fox.  Dan found that early in the season, if you want to know what a team’s season-ending winning percentage will be, you’re best to look at their Pythagorean record.  That is, until about 100 games in, when the team’s actual record becomes the better predictor of their season ending record.  (By the end of the year, actual record is a perfect predictor of season-ending actual record.)  But which one better predicts what a team will do in its future games?
In July of this year, Joe Sheehan of Baseball Prospectus made the assertion that “Run differential is a key measure of team quality, and a better predictor of future performance than win-loss record.”  Well now, sounds like something we can test.  I took the Retrosheet Game Logs from 1980-2006.  (666, no kidding, team-seasons)  I took each team’s games in sequence.  After each game, I calculated the team’s actual winning percentage, as of that moment, as well as their Pythagorean projection as of that moment.  So, if a team is 10-10 after 20 games and had scored 93 runs while giving up 91, I ran the numbers.  (Methodological note: I used the David Smyth/Patriot formula and the standard formula with a 1.82 exponent, although they were pretty indistinguishable, so I just reported the Smyth formula)  Then, I calculated the team’s actual winning percentage over the rest of the season.  So, if that team went 72-70 over the last 142 games, I calculated those numbers.  I ran the numbers 162 times, one for each game of the year.  Which of the first two (current actual win percentage or current Pythagorean projection) was a better predictor of performance over the rest of the season from that point forward?
Want to see a pretty graph?
The graph shows correlation coefficients of the two methods to performance the rest of the way.  Coefficients are low at the beginning of the season because after game one, everyone’s either got a winning percentage of 1.000 or .000, and that’s not going to correlate well with much of anything.  At the end of the year, there’s the same problem in the opposite direction.  Focus on the middle part of the graph, where the sample sizes in both halves are roughly equivalent.  That’s where the story is.  You’ll see that the green line, representing the Pythagorean projection (using the Smyth method, although the 1.82 method had the same pattern) at that particular moment is consistently above actual winning percentage.  At the exact midpoint of the season (81 games), Pythagorean projection correlates with winning percentage the rest of the way at .494, while actual winning percentage has a correlation of .464.
(Side note: The weird jump around game 110 is because of the 1981 and 1994 seasons.  Teams played a little less than 110 games in those years, which led to some funky data in those years… just enough to cause a little blip in the data.)
In terms of predictive power, run differential really is the more important information to know when it comes to predicting the future.  What’s the deal with the Diamondbacks?  Well, for what it’s worth, the correlation between Pythagorean projection and future performance at 81 games is .5, which isn’t bad, but it isn’t all that great.  In fact .5 is possibly the most infuriating correlation coefficient out there.  .5 means that about 25% of the variance is explainable by whatever factor you’re using as a predictor.  25% is a quarter of the variance!  But 25% is only a quarter of the variance.  As the season wears on, the gap between Pythagorean and actual win percentage narrows, until they become roughly the same around game 150 or so, where the correlations are around .35.  The thing is that at game 150, the sample size for the “rest of the season” is only 12 games, and by that point, Pythagorean projection and actual winning percentage are usually mirroring one another.
But, there’s evidence here that a team is better described, over the long run, by their run differential than their actual record.  This will certainly come as great news to fans of the Padres and Braves, who finished with the 2nd and 3rd best Pythagorean win percentages in the NL this year, as they watch the Diamondbacks in the playoffs this year.