How do I become a Sabermetrician?

Occasionally, I get e-mail from someone who reads StatSpeak or some of the other writings that I sprinkle into the blogosphere, and my favorite always goes something like “I’ve read a bunch of stuff around and I’m interested in learning how to do my own Sabermetric research.  Can you help me?”  Yes, I can.  I’m a therapist by training, and do you ever need help!
So you wanna be a Sabermetrician, eh?  Well, first you should know that there’s no school for Sabermetrics (well, there is a class out there…)  We’re all self-taught in one way or another, mostly in the form of guys using skills from their day jobs to study baseball.  It’s part of the charm of the field.  Most of us have respectable day jobs and we use this just to pass the time.  Just about anyone can get themselves a free blog and start posting their work.  That’s how I started out.  So if you want to be a Sabermetrician, then by the power vested in me by no one in particular and the state of confusion, I now pronounce you an official Sabermetrician.  The certificate’s in the mail.
Now of course, you don’t want to be just any Sabermetrician.  You want to be one of those cool guys that actually gets hired by an MLB team someday.  You want to publish a book.  You want to be the next big thing.  I suppose I’m not any of those things either, but I can give you a few tips on how to get started.

  1. I can’t stress this enough.  There are far too many junk stats out there.  A junk stat goes something like this.  “I just came up with the formula HR x 15 + RBI x 7 + HBP x 4.5 + SLG x 90 based on how important I thought each one was”  I’ve heard that particular reasoning far too many times.  There are formulae that look like that, but they are developed using a very specific process.  I’ve seen several cases of someone posting one of those, being ignored, and then disappearing never to be heard from again.  I’m guessing that they were frustrated that no one saw their brilliance.  Don’t start with a junk stat and be frustrated.  There is good work to be done and you might be the one who can do it.  Read on.
  2. Spend a few months reading Sabermetric work.  There are plenty of good sites out thereWe all link to each other.  Read their stuff.  Read the comments.  Read Baseball Between the Numbers.  (When you get advanced enough, read The Book: Playing the Percentages in Baseball)  Go over to the Baseball Fever boards and read the discussions that go on over there.  Participate.
  3. One of the things that can frustrate newcomers is the thought that their brilliant ideas that came to them in the middle of the night… have already been studied by someone else.  We’ve all done studies on the illusion of clutch and why RBIs are a bad stat (and bad grammar).  They’ve been studied to death… unless you can take a little more nuanced look at things.  And to do that, you’ll need a good understanding of what research has come before you.  Probably the biggest mistake that people make is to try to jump into Sabermetrics with both feet, not really knowing what they’re doing.  Slowly, my friend.  Slowly.
  4. You’ve probably already read Moneyball, which should give you a broader idea of what’s going on.  We are not in the business of making baseball more “pure” or more enjoyable or more special or more cosmic or more whatever.  (Do watch Field of Dreams, because it’s a good movie… but understand that’s not what we do here.)  Sabermetrics is the scientific method applied to the goal of winning a baseball game/championship.  I’ll type that again.  Sabermetrics is the scientific method applied to the goal of winning a baseball game/championship.  May I recommend that you have some background in the scientific method before you begin.  I’m not saying that you need to be a Ph.D. level physicist, but simply that you need to understand how science works.  Yes, we spend a lot of time debunking some sacred conventional wisdom.  Be prepared to have some of your basic beliefs about baseball challenged.
  5. It’s good to be a fan.  In fact, I recommend that you watch/listen to/go to as many baseball games as you can.  It’s OK to have a favorite team and to occasionally be irrational in evaluating them, because you love them.  Ask me about growing up with the Cleveland Indians some time.  But, with that said, understand that science is a dispassionate process.  We go into a situation not looking to confirm that so-and-so is the best player in baseball, but we come up with a reasonable definition of things and let the numbers fall where they may.  Sometimes that means realizing that the numbers don’t bear out what you used to think as a kid (or as a fan now).  That’s actually a lot harder to come to terms with than you might imagine.  If you can get past that, you’ll make a fine Sabermetrician.
  6. Are you in college?  (Surprise!  A lot of the guys who travel in these circles are in/barely out of college themselves!)  Sign up for a class in statistics.  Trust me on this one.  Even if you’re an English major, it’ll come in handy both in Sabermetrics and in the rest of life.  Plus, it’ll teach you a little bit of how to use some of the computer programs that Sabermetricians like to use.  And computers make life so much easier.
  7. Draw from your background.  I’m a psychologist by training.  Most of the questions that intrigue me center around “Why did he do that?”  That’s what I’ve been trained to look for in life.  You may think that your chosen field has nothing to do with baseball, but you’re wrong.  Sure, there are a lot of guys who are physics/math majors who look at algorithims for figuring out what a player will do next year, and that’s fine.  I’m personally waiting for a good Sabermetric sociologist to come along to figure out why it is that baseball teams and society in general are so poor in assigning value to baseball players. 
  8. You do not need a doctorate in math.  Sure, the more analytical techniques you know, the more complicated questions you can ask.  And you do have to know some statistical/analytical techniques, but some of the biggest discoveries in Sabermetrics involve little more than knowing what a correlation is (e.g., DIPS) and are simple to the point of elegance.  The math can be taught.  The real work in Sabermetrics is perceptual and creative.  It’s in seeing the game in a slightly new way and understanding how that insight can be measured and then tested.  The rest is just an engineering problem.
  9. Keep a running idea list of things that you want to accomplish and ideas that you’ve had.  Any time I have an idea pop into my head, I put it into my special file.  When I need a project, I go back and pick one that sounds fun.  Even if you don’t know exactly how you’d do it, if an interesting question or idea occurs to you, write it down.
  10. You’ll notice that I haven’t specifically pointed you to any how-to guides.  The reason is that you’ll come across those in the process of reading through things.  And you’ll also learn what other statistical tricks that others use by osmosis.  Don’t focus so much on the actual technical details of how Pitch f/x works or what’s available from Retrosheet.  If you really get restless, download some Retrosheet files and play around with them, but you’ll probably learn naturally just by doing some reading.

World Famous StatSpeak Roundtable: June 30

OK, so I lied.  Last week, I said that there would be no roundtable this week.  Through the magic of technology, we were able to gather together a roundtable, although don’t ask exactly how that was accomplished.  It involves the fact that as this is being published, Pizza Cutter doesn’t have internet access.
Anyway, this week, in the ultimate act of nepotism, we welcome as our guest Corey Seidman from MVN’s Phillies blog Phanatic Phollow UpWon’t you read on as we discuss set up guys, division leaders and Curt Schilling.
Question #1: Of the current division leaders, which ones don’t you expect to be there at the end of the season.  Whom do you expect will overtake them?
Corey Seidman: We find ourselves at the halfway point with the Red Sox, White Sox, Angels, Phillies, Cubs, and Diamondbacks in first place. I see all six of these teams winning their respective divisions.
The Red Sox have been the best team in the American League to this point, with their only criticism being their sub-.500 road record. But they haven’t been as bad as they have been unlucky on the road. They were swept in Toronto following their season opening series against the Athletics … in Tokyo. It’s hard to hold a team accountable when they’re given a day to travel from Japan to Canada and start another series. Of their 19 other road losses, 10 were one-run games. This doesn’t show that they can’t win on the road, it merely shows they have been unlucky on the road through the first half of the season.
The White Sox pitching has been great, which is why they find themselves ahead of the surprising Twins and disappointing (yet surging) Tigers. The Sox rank second to only Oakland in ERA (3.43), opponent’s OBP (.307), and WHIP (1.24.) They lead all of baseball with 49 quality starts. Their bullpen is second in reliever’s ERA and features two late-inning guys with 0.84 WHIP’s in Scott Linebrink and Matt Thornton, as well as Bobby Jenks.
Unfortunately, their two most heralded run producers are having the worst seasons of their career, in the same year. Paul Konerko has a .368 slugging percentage (career .490), and Jim Thome has driven in only 38 runs in 73 games. Thome is on pace for 81 RBI, his lowest total in a full season since 1995. Despite Konerko’s and Thome’s struggles, the White Sox are still the best team in the A.L. Central. Carlos Quentin, A.J. Pierzynski and Joe Crede have held them together offensively, and let’s face it, Konerko and Thome couldn’t be any worse in the second half than they were in the first.
The Angels are the best team in the A.L. West. Their 3.5 game lead and 4-3 record against the second place A’s doesn’t show their dominance, but don’t expect the A’s to continue their winning ways much longer. They’ve pitched out of their mind, and we’re one Rich Harden pulled muscle and one Justin Duchsherer look in the mirror away from seeing them fall fast. The Angels have the best starting staff in baseball from 1-5 with the emergence of Joe Saunders and the return of Ervin Santana. Francisco Rodriguez is on pace to set the single-season record in saves, Scot Shields continues to look like he could close for any other team in baseball, and the back end of their bullpen has only improved this year with the addition of Jose Arredondo (1.40 ERA, 0.72 WHIP, 19 K in 19.1 IP.) Add in a collection of little speedy guys (Figgins, Izturis, Aybar, Kendrick), good defense (Hunter and Matthews Jr.) and a slugger returning to form (Vlad), and you’ve got a team that makes the playoffs every year.
For the wild card, I expect the Yankees to make a late push as they have done in recent years to overtake the Rays. Right now, the Rays look like an unstoppable team, but they just strike me as being a year or so away from seriously competing. I could see them winning it but could also see them having a bad September and letting the Yankees slip past, then have a disappointing season in 2009 that leads everyone to say this year was a fluke, before making the playoffs in 2010. Either could happen but neither would surprise me.
The Phillies are the best team in the N.L. East, and will win it, barring a catastrophic injury (Utley or Hamels.) They are considerably younger and healthier than the Braves and Mets, and haven’t had nearly the amount of different lineups the other two have had. The Marlins were a young team that overachieved for two months and are coming back to reality now. They don’t have the pitching to continue. Tell me all you want about Josh Johnson coming back, but I see a starting staff that’s best piece is Scott Olsen and his 4.89 K/9. Andrew Miller is struggling, Mark Hendrickson looks like this year’s Adam Eaton, and only Ricky Nolasco is picking it up lately. The Phils have the 3rd most quality starts in the N.L., the best bullpen ERA in baseball, and a lineup that is finally breaking out of a 10 game slump. Ryan Howard has struggled all season, yet still leads the N.L. with 67 RBI. Imagine if he was hitting .250 instead of .215. He’d have closer to 80.
The Cubs had been awesome all season, but have struggled lately. Regardless, they are the Red Sox of the N.L. this year. They are the best team, have a ridiculous home record of 33-10, and are below .500 on the road. They lead baseball with 442 runs scored, are 4th in the N.L. in runs allowed, and their Pythagorean W/L is a game better than they are. Offensively, they have done it through periods without Alfonso Soriano. Probably because they have 7 regulars hitting above .280. They’ll have home field advantage.
The Diamondbacks will win it because they are in the worst division in baseball. The N.L. West was extremely tight last year, but the Padres and Rockies forgot how to win this season. The Dodgers aren’t good enough to overtake the D-Backs or they already would have. The Diamondbacks have been scuffling for a while and still haven’t lost much ground. The advantage in pitching goes to Arizona and their two aces, as does the division. They aren’t anything spectacular offensively, and Eric Byrnes might have only hustled and gritted his way to a big contract, but nobody else in the West is good enough.
The wildcard will go to the Cardinals here. The Brewers are making a push, but they have shown us over the last season and a half that they are a streaky team. The Cards had been getting it done without Albert Pujols, and despite the numbers suggesting Ryan Ludwick can’t keep this up, he likely won’t need to for the Cards to win the wildcard. (Check who leads the Cardinals in ERA. You won’t regret it.)
Eric Seidman: So, right now we’re looking at the Phillies, Cubs, and Diamondbacks in the National League. If forced to bet money it would be put on all three of these teams winning their division. As a Phillies fan I am still not sold on the division being as easy as it has been; easy as in, the Phillies lose 8 of 11 games and gain ground. I just have a funny, non-saber feeling, that if the Mets or Braves sweep them in an upcoming three-game series, it could rejuvenate their season and propel them toward some relative success.
I don’t see the Cubs dropping off though keep in mind the Cardinals have Wainwright, Carpenter, Mulder, and Clement on the DL. Who knows if any of them will come back and/or be successful, but it is a possibility. Ultimately, though, I really don’t see them posing a significant threat to the Cubs (in the regular season).
Out west, the DBacks should win the division fairly easily but we all saw last year how an insane winning streak at the end of a season can come out of nowhere and potentially skyrocket a team toward the top of the division. Without Rafael Furcal the Dodgers, essentially, have an ugly offense, even going hitless last night (yet still winning!). So, in the NL I will pick the three current winners though if I have to pick a team to potentially overtake the leaders I will go with Mets, Cards, Dodgers.
In the AL, I see the Red Sox, White Sox, and Angels winning their divisions. The Tigers have been on fire lately and the Athletics have performed well this year, too. Oh, and the Rays! And the Yankees! And the Orioles are 41-38! Okay, I’ll calm down a little. I’ll take Red Sox winning the division with the Rays winning the Wild Card and the Yankees finishing 1-2 games behind the Rays. I’m going to take the White Sox to win the Central, and the Angels to, very soon, separate themselves from the As.
Pizza Cutter: As I write this, the AL division leaders are Boston, the White Sox (by half a game over Minnesota), and the Angels.  I think all three are vulnerable.  I’ve sung the praises of Tampa Bay previously, although that one might just be hope on my part.  Boston’s still the better team, but weird things happen in baseball.  The White Sox will win the Central.  If Minnesota is actually leading the division by Monday morning, put them in as my pick to be de-throned.  The Angels are a few games up on the A’s, but the A’s have the far better run differential.  And the A’s will probably make a few moves at the trading deadline.  This could turn into a matter of who adds more at the trading deadline.  In the NL, on the other hand, I don’t see anyone moving up over Philly or the Cubs.  The NL West doesn’t matter because everyone in the division is slouching toward mediocrity.  It’ll probably be Arizona… but that’s only beause someone has to win it.
Read more of this post

Does Movement Influence BABIP?

A couple weeks back, Pizza Cutter found an interesting oddity in that Troy Percival had consistently posted very, very low BABIPs. In response, Dave Studeman brought up Mariano Rivera–another pitcher with consistently low BABIPs–and how it has been somewhat proven that elite relievers can register atypical results with this statistic. Mentioned on a few other sites was the idea that movement may be a central cause for these lower batting averages on balls in play; due to said movement, the sweet part of the bat would fail to meet the ball as it normally would on more “standard” pitches.
Last week, we explored the relationship between fastballs 92+ mph and BABIP, examining how it differed at each mile per hour interval. 92 mph to 96 mph clocked in between .290-.310–the established general range of BABIP for pitchers–before dipping to .273 at 97 mph and shooting back up to .293 for all thrown 98 mph or higher. The 97 and 98+ groups were too small in their sample sizes to definitively fail the 5% hypothesis; we would need around 1,650 balls in play and, combined, had 1,032. Still, the combo of 97 and 98+ offered a .279 BABIP, perhaps suggesting that the .293 at 98+ was the anomaly, not the .273.
Today we will look at the movement within the same 92+ mph range in order to attempt to answer the question posed in the title. First, though, a pre-requisite of sorts with regards to movement: the relationship between horizontal and vertical components is not extremely known yet other than some telltale signs aiding in the classification of pitches. For instance, a two-seam fastball will have much higher horizontal movement than vertical movement; however, four-seam fastballs generally have lower horizontal movement and higher vertical movement.
I queried my database for all fastballs 92+ mph and separated the results into groups by movement rather than velocity intervals. The signs (+-) were reversed so that righties and lefties could be grouped together as well. First, here is a sample size grid of sorts, showing all balls in play for each horizontal group and vertical subgroup; note that the subgroups differ for each horizontal movement grouping so they will be called simply below average or above average as they were essentially determined by the average or a similar type of cutoff point. The reasoning for this is the aforementioned relationship between movement components; for fastballs, lower horizontal movement will usually correlate with higher vertical movement with the inverse also being true.

Horizontal

Below Vert BIP

Above Vert BIP

0-4 in

3,735

2,456

4-8 in

6,823

4,718

8-12 in

4,355

3,227

12+ in

408

335

BABIP takes a while to stabilize, moreso than many other statistics, so I wanted to have at least 2,000 balls in play for each sub-grouping, preferably more. From 0-12 inches of horizontal movement we have large enough samples to notice discrepancies. Greater than 12 inches, however, offers just 743 balls in play. While I definitely plan to explore this and the velocity articles later in the year when more data is available, for now, I am going to exclude the group with more than 12 horizontal inches.
Looking at the other three groups and their two subgroupings each, here are the Ball%, Strike%, HR%, and BABIP:

Horiz.

Vert.

B%

K%

HR%

BABIP

0-4

Below

35.9

45.6

0.53

.289

0-4

Above

34.9

49.8

0.48

.286

4-8

Below

35.8

43.7

0.64

.302

4-8

Above

35.8

48.2

0.58

.292

8-12

Below

35.6

41.4

0.54

.315

8-12

Above

36.5

45.6

0.58

.298

The percentage of balls essentially stays in the same general range while the strikes fluctuate. The subgroupings with above average vertical movement have much higher strike percentages than others. So, judging by this it seems before we even get to BABIP, that higher vertical movement in these larger groups result in a higher percentage of strikes.

The BABIPs for horizontal movement groups with below average vertical movement register: .289, .302, and .315. The BABIPs for horizontal movement with above average vertical movement clock in at: .286, .292, .298. Judging from these results it would appear that, yes, movement does have some type of effect on BABIP. Each horizontal group posted higher counts when they had below average vertical movement, and at every interval as well; .289 to .286, .302 to .292, and .315 to .298. Additionally, all pitches 92+ mph with 0-4 inches of horizontal movement, regardless of whether or not they fell above or below the vertical cutoff point, produced a BABIP lower than .290, which is generally the lower edge of the .290-.310 range we expect it to fall into.

Tomorrow I’ll come right back with the total number of unique pitchers and those comprising at least 1% and at least 5% of the sample, in order to see if the results are skewed in any way. For now, though, it appears that, regardless of your horizontal movement, having above average vertical movement will produce a lower BABIP at each horizontal interval.

Vindicating Derek Jeter's fielding at short (sorta)

Introducing OPA!

Vindicating Derek Jeter’s fielding at short (sorta)

Introducing OPA!  OPA! is my new (still in the works) fielding system for use with Retrosheet, one that I’ve been meaning to create for a while now.  Last week, I teased the beginnings of OPA!, at least the ground ball part.  This week, a more full exploration of ways in which we can rate infield play without the benefit of knowing where the ball went.
First, the framework.  You may be wondering what OPA! stands for.  Other than my goal of making it the most festively-named fielding system out there (next time you go to a Greek wedding, they won’t be shouting UZR! or FRAA!), OPA! is short for OPAAA, or out probability added above average.  Consider a ground ball.  Any ground ball will do.  The infielder’s job is to turn it into an out.  He can either succeed or fail at this job, but several things must happen in order for him to succeed.  He must have:

  • Good range: he has to get himself and his glove in the neighborhood of the ball
  • Good hands: he has to actually get the ball into his glove
  • Good arm: he has to then throw the ball to first (or second?) and put it somewhere in the neighborhood of the first baseman’s glove
  • The first baseman has to catch the ball

All of these things must happen in order for a ground ball to become a ground out.  One of the major problems that I see with some of the major fielding systems is that they treat all of these as one giant package.  Either the play was made or it was not.  Sure, the point of the game is to make the play, but let’s think about the following situations.  A ground ball to short where the SS gets to the ball, fields it cleanly, makes a throw right to the first baseman… who drops the ball.  Sure, the 1B will pick up an error for his efforts, but the play not being completed, the SS gets no credit when he did everything right!
One of the things that spawned the new generation of fielding stats was an understanding that fielding percentage, indeed, the entire concept of an “error” was flawed.  An error means that the fielder did something right, namely that he got to the ball.  Yes, he booted it, but we don’t have a debit for those guys who are too slow to even get to the ball to begin with.  So, an error actually penalizes one of the skills that you hope a player has.  But, the type of error given (fielding, throwing) does tell us where things went wrong.  It’s time to develop that line of logic more fully.
The average ground ball to somewhere on the third base side of the infield has an X% chance on average of being turned into an out.  We can play with the parameters around pitcher handedness and batter handedness and if I had more detailed data, hit location, but there will be some number that emerges.  The very act of the fielder ranging to the ball and at least stopping it from going to the outfield adds some additional percentage chance that the ball will become an out.  Letting the ball through destroys what chance there was to make an out.  (I’m sure most of you have figured out by this point, but if anyone’s still lagging, I’m basing this model on the idea of WPA.)  If the third baseman makes the play, we ought to credit him with the out probability he adds based on his range.  If the ball goes through to left field, we should assign the 3B some blame, along with the shortstop.  How to chop up that blame was neatly explored last week. 
But, now let’s take a look at what happens if the third baseman gets to the ball (range), but boots it (hands).  He’ll be charged with a fielding error, and the out probability that he built up by getting to the ball is now gone.  To more accurately reflect what happened though, we can put his range OPA in the “range” basket and debit his “hands” basket.  (And if the first baseman drops the ball, we can debit his “hands” basket, while leaving the third baseman’s contributions alone.)  Now, we have a much more fine-grained idea of where a player’s strengths and weaknesses are. 
That’s the theory.  For the numerical spaghetti and some 2007 results (including a few things about Jeter), keep reading.
Read more of this post

World Famous StatSpeak Roundtable: June 23

This week’s roundtable will have to last you two weeks.  Next week (6/30), we will sadly have to interrupt our usual Roundtable service due to the fact that the table and everything else that I own will be in a moving van working its way across a couple of states.  But, this week, we do have the fun of welcoming David Appleman, proprietor of FanGraphs.com, where he serves up all sorts of baseball-related statistical gooeyness for us statistically inclined folks.  Read on as David, Eric, and Pizza talk about whom they want on the mound in Game 7, lacking balance, and the Blue Jays’ rotation.
Question #1: If you had to win a single game of baseball, which active starting pitcher would you most want on the mound?
David Appleman: I think the obvious and almost unanimous choice a year ago would have been Johan Santana, but he’s not quite pitching at such a ridiculous level anymore and I’m hesitant to put him at the very top of any list I’d have to make today. Roy Halladay is one of the few other pitchers that comes to mind and he’s currently having the best season of his career. He leads the majors in K/BB, and is an extreme groundball pitcher to boot, meaning you’re going to keep your home runs to a minimum.
After sifting through the stats a bit more, Josh Beckett really stood out to me. While his ERA (3.84) doesn’t show it, he’s pitching arguably better than he ever has. He’s striking out over a batter an inning and his walks are at a career low. And while I typically don’t put a whole lot of weight on post-season statistics, he does seem to have a knack for the big game, which shows up in both his ERA and his peripherals.
Finally, C.C. Sabathia, who also doesn’t have a great ERA (4.06) has been as good as anyone after April. He’s striking out batters at a career pace and there’s not really anything bad you can say about the guy’s pitching. Since I brought up Beckett’s postseason, I guess it’s only fair I mention Sabathia’s. Yes, he was horrible last year and issued way too many walks, but I’d still give him the benefit of the doubt.
While it’s a tough decision, I think I’d have to go with Roy Halladay as my #1 guy right now, closely followed by C.C. Sabathia, and then I’d have insert Santana as my #3 choice (because despite his slight decline, he’s still very good), with Beckett left as an alternate.
Eric Seidman: It’s tough because we have to set some parameters.  If we’re talking about one game right now with only active pitchers (meaning nobody on the DL) I would probably pick CC Sabathia or Cole Hamels.  Sabathia’s poor start was mainly attributed to two consecutive early starts in which he allowed 9 runs each.  After that he has been very stellar.  Hamels is Hamels, one of the best pitchers in baseball.  If we’re talking about anyone, I’ve always been a member of the “give me John Smoltz in a must-win” bandwagon.  According to Fangraph’s clutch stats, Vicente Padilla has been the guy people should want on the mound the most; after watching him for years in Philadelphia I’ll have to disagree there, though.  If I have to pick I’ll say Hamels simply because he has proven himself capable of “stopping” and has the mental makeup required to sustain the confidence level required in a must-win situation,
Pizza Cutter: Brandon Webb keeps the ball on the ground the best and has a quite-good K/BB ratio and has an FIP of 3.00.  Roy Halladay has the best K/BB ratio and a quite-good keep it on the ground ratio, plus an FIP of 2.85.  Halladay throws harder and is less reliant on his fast ball, throwing it only 45% of the time (to Webb’s 70%).  Halladay gets my vote, as of right this moment.  Odd that Halladay probably wouldn’t be the first name off anyone’s lips in the general public.  A guy that good toils in obscurity.  Sad.
Read more of this post

Heater Getting Hotter

Yesterday we looked at the averages of fastballs from different velocity groups as a means to compare certain pitchers to their like-throwing peers as opposed to an extremely broad group.  This way, we can compare Matt Cain’s movement to the average movement for all 94 mph fastballs to determine how effective it has been.
In doing so an anomaly surfaced: all velocity groups had a BABIP between .290-.310 except those thrown 97 mph.  Those heaters registered a .273 BABIP, nearly 20 points below the others.  Sure enough, fastballs registering 98 mph or higher jumped back to .293, leading many of us to believe something screwy, flukey, or any other adjective ending with the suffix “-y” slapped on its end, was taking place.  After exploring some logical possibilities, like a split-half reliability test, or a look at BABIP by count and location, the results either stuck or were inconclusive due to small sample sizes at work.
We had a really nice discussion in the comments section wherein more possibilities were tossed around.  The first of these suggestions involved testing the sample size via a Bernoulli Trial.  As was shown by commenter Adam Guetz, for an observed .273 when a .295 was expected, we would need approximately 1,650 balls in play.  For 97 mph pitches there were 707 balls in play, less than half of what is required, and just 325 balls in play for 98+ mph.  While the sample sizes of actual pitches thrown are large enough to conduct certain analyses, those of balls in play for anything 97 mph or higher were not.  Here are the BIP sample sizes:

  • 92 mph, 18.85 % BIP and 7,759 total
  • 93 mph, 18.05% BIP and 6,023 total
  • 94 mph, 18.05% BIP and 4,389 total
  • 95 mph, 17.04% BIP and 2,827 total
  • 96 mph, 17.26% BIP and 1,596 total
  • 97 mph, 16.69% BIP and 707 total
  • >98 mph, 16.11% BIP and 325 total

The samples from 92-96 appear large enough, but the combination of 97 and 98+ still comes a good 500 pitches below 96 mph on its own.  Another suggestion called for the total number of different pitchers as each interval as well as the number of those comprising certain percentages of the samples.  This way, we might be able to deduce that 97 mph pitches were skewed due to a small group representing the whole; for the lower velocities, which are more common, it is much more likely for the pitches to be more evenly divided amongst a larger group of pitchers.  Here are the number of pitchers for each group, those comprising 1% of the sample, and those comprising 5% of the sample:

  • 92 mph: 574 total pitchers, 8 at 1%, 0 at 5%
  • 93 mph: 485 total pitchers, 18 at 1%, 0 at 5%
  • 94 mph: 516 total pitchers, 21 at 1%, 0 at 5%
  • 95 mph: 337 total pitchers, 25 at 1%, 0 at 5%
  • 96 mph: 237 total pitchers, 28 at 1%, 1 at 5%
  • 97 mph: 160 total pitchers, 25 at 1%, 4 at 5%
  • >98 mph: 102 total pitchers, 18 at 1%, 8 at 5%

In the 97 mph group, the four pitchers with at least 5% of the sample combine to represent 23% of the total.  For 98+ mph, the eight pitchers with at least 5% of the sample combine to represent 56% of the total.
From these results it seems that 92-96 mph are safe from a drastic case of small sample size syndrome.  Anything abobe 97 mph, though, seems to be the opposite as they suffer from a small sample of balls in play as well as skewed results due to a small group of pitchers representing most of the total pitches. 
Another commenter, Dave Evans, pointed out that he received a significance of 0.55 when comparing 97 and 98+, meaning their BABIPs were not statistically significantly different; for significance, that value would need to be equal to or below 0.01.  This led me to group 97 and 98+ together, to enlarge the sample.  The result was 1,032 balls in play, 288 hits in play, and a .279 BABIP.  This suggested the possibility that perhaps it was not 97 mph that deserved the adjective+suffix “-y” treatment but rather 98+ mph pitches.  Granted, it is still a small sample, even moreso for BABIP, but perhaps we will find out, as more data becomes available, that 97 mph is the threshold, as Pizza Cutter noted, for “blowing it by the hitter.”
It will require several hundred more pitches in play to determine this with any certainty but I will be keeping very close tabs as the season progresses.  For now, though, we can effectively compare individual pitchers to the average movement components, B%, K%, and BABIP for their specific velocity, not an entire group, at least for heaters 92 mph to 96 mph.

Breaking Down the Heater

Back on December 20th, John Walsh wrote a very interesting article at The Hardball Times, taking everything recorded by the Pitch F/X system in 2007 and, amongst others, calculating the average velocity, horizontal movement, and vertical movement for the four major pitches: fastball, curveball, slider, and changeup.  The results showed that the average fastball clocked in at 91 mph with -6.2 inches of horizontal movement and 8.9 inches of vertical movement.  The author acknowledged that he did not differentiate between four-seamers, two-seamers, and cutters, but rather lumped them all together in determining the averages; two-seamers and cutters differ in velocity and movement components from four-seamers.

While I plan on calculating the averages for all different sub-groupings of pitches at some point, what recently piqued my interest was finding the averages for different velocity groupings.  As in, what is the average horizontal movement for all 94 mph fastballs?  Or, the BABIP for 98 mph fastballs? 
With that knowledge we could effectively compare certain pitchers to the means of their velocity grouping rather than overall averages of every grouping.  Instead of comparing, say, Edwin Jackson’s 94 mph fastball to a group including those who throw slower, we can compare him to his “peers.” 
I started at 92 mph and queried my database for groupings (92-92.99, 93-93.99, etc) all the way up until 98+ mph.  I figured 92 mph would be a solid starting point since the sample size would be extraordinarily large–large enough for four-seamers to overcome the two-seamers and cutters that may inevitably sneak in.  Anything 98 mph or higher was grouped together to ensure a large enough sample since, as you will see below, the higher the velocity, the smaller the sample:

Velocity

Sample

%

92 mph

41,157

31.4

93 mph

33,368

25.5

94 mph

24,315

18.6

95 mph

16,586

12.7

96 mph

9,245

7.1

97 mph

4,236

3.2

>98 mph

2,018

1.5

All of the sample sizes here were large enough for analysis.  Even though the 98+ group appears to be 1/20th the size of the 92 mph group, that speaks more for the latter than against the former.
Next, how do the movement components look for each group?

Velocity

Horiz.

Vert.

92 mph

-6.34

9.24

93 mph

-6.28

9.51

94 mph

-6.16

9.80

95 mph

-5.98

10.07

96 mph

-5.84

10.23

97 mph

-5.89

10.41

>98 mph

-6.03

10.38

It should be fairly apparent that the tendency is for horizontal movement to decrease and vertical movement to increase as the velocity increases, at least through 96 mph.  At 97 mph, both movement components increase.  At 98+ mph, the vertical movement stays stagnant while the horizontal movement jumps quite a bit.
The next area to discuss includes B%, K%, HR%, and BABIP:

Velocity

B%

K%

HR%

BABIP

92 mph

35.9

44.6

0.65

.302

93 mph

36.3

45.1

0.55

.303

94 mph

35.5

45.9

0.55

.292

95 mph

35.8

46.4

0.76

.303

96 mph

35.2

47.0

0.54

.291

97 mph

36.1

46.8

0.41

.273

>98 mph

33.9

49.3

0.69

.293

The percentage of balls doesn’t move too much until its dip of over two percentage points at 98+ mph.  The amount of strikes, however, seems to increase.  There is no real discernible pattern in the home run percentages; the most came on 95 mph heaters while the least came on those registering 97 mph.

Speaking of the 97 mph group, notice anything odd?  Perhaps that their BABIP is .273, a full eighteen points below any other group?  Prior to getting the results I expected each group to fall somewhere in the .290-.310 range; that all of them did except the .273 struck me as very peculiar.

I spoke to several other analysts, all of whom initially mentioned small sample size syndrome, only to redact the assessment after learning the sample sizes in question.  The dropoff in home run percentage was tossed around, as well, since less home runs means more balls in play to be counted in the BABIP formula.  This is a “could be,” though, rather than a “definitely why.”  As was mentioned in these discussions, too, it could be nothing; perhaps there were more warning track flyballs that just missed leaving the yard as opposed to weaker hit balls.

Now, while the 4,236 pitches at 97 mph constitutes a large enough sample to analyze, the balls in play were not large enough yet to break into individual counts or locations.  When they do get big enough this could serve as a means of explanation; perhaps something in either or both does not jive with the other velocity groups.  Of those with significance, however, there was a .263 BABIP on 0-0 counts, and a .286 BABIP on pitches in the middle of the strike zone.

Pizza Cutter, or “The Master of Statistical Reliability” as I like to call him (yeah, a nickname for a nickname), suggested that BABIP is one of those stats that is super-unreliable, even with my large sample of pitches.  I did a split-half reliability test, randomly splitting the sample in half, and calculating the BABIP of each half.  For those unfamiliar, this serves to test the reliability of the sample; if it truly is large enough then no matter how we cut the sample in half we will have fairly convergent results.  If the results were wildly divergent then we are dealing with an unreliable sample.  The BABIPs of the two groups were .271 and .275, which essentially threw that idea out of the window.

Something interesting to consider was how, in each of these tables, all patterns seemed to stop when they reached 97 mph or higher.  The horizontal movement increased instead of its decreasing trend; vertical movement decreased after its increase at 97; the percentage of strikes ceased increasing; and home runs reached their low.  Could be something, could be nothing, but interesting nonetheless.

For now I am going to chalk this BABIP drop as an extreme random statistical variation and hope that you loyal readers out there might chime in with some more ideas to investigate.  Otherwise, though, when gauging the movement components, percentage of balls/strikes/home runs, or even BABIP, we can compare individual pitchers to their “like-minded” averages by velocity grouping.  If I get enough feedback involving different aspects to measure regarding these fastballs we will look at that soon, in the next day or two.  Otherwise, next week I have something similar to this, looking at BABIP by movement.

Playing the blame game with ground ball singles

I’m building something.  It’s something that I’ve been meaning to do for a while, which is a defense rating system.  In fact, I once defined defense as “something that every Sabermetrician has a system for measuring that he is ‘working on.’ “  I guess now I’m a proper Sabermetrician.
I’m not exactly the first person to tackle this one.  There’s the Fielding Bible with its lovely data from Baseball Info Solutions, which I use as my gold standard.  The problem is that those data are proprietary (read: expensive), and I’m a graduate student.  There are a few other systems that have caught my eye.  Shane Jensen and friends developed the Spatial Aggregate Fielding Evaluation (SAFE) system and they got mentioned in a few newspapers, mostly dismissively (for being far too nerdy — because the worst thing you can be in baseball is a nerd – and) for showing (again) that Derek Jeter isn’t a very good shortstop.  There are plenty of others, and listing them turns into a lovely alphabet soup (PMR, ZR, UZR, RZR, FRAA, DER, and of course the greatest fielding stat ever, fielding percentage)
But, the ancestry on my system traces back in part to my former colleague Sean Smith, who about a year ago here on StatSpeak introduced TotalZone (and here’s part 2 and his latest on the subject), which was a system based only on what was available from Retrosheet, where the data are the perfect price for a graduate student: free.  Dan Fox, formerly of Baseball Prospectus, now of the Pittsburgh Pirates, also went about the business of creating a Retrosheet-based system for fielding, which he called simple fielding runs.  But Sean’s gone from StatSpeak and Dan’s gone to that big front office in the sky… er, Pittsburgh.
So, here, I pick up the baton.  I need a Retrosheet compatable system that isn’t just a poor man’s rip off of the other systems.  (On the second, I fear that I shall fail miserably.)  And so I start with the ground ball.  It always ends up in someone’s glove.  Whether that glove is on the hand of an infielder, an outfielder, or the occasional fan is the question.  Usually, it’s a good thing if the man who fields a ground ball is an infielder rather than an outfielder, but who’s to blame if it gets through the infield?  Both Dan’s and Sean’s system assume that if a ground ball goes through to the left fielder, we can blame that half-and-half on the third baseman and the shortstop.  They do similar things for CF (50% the fault of the 2B, 50% the fault of the SS) and RF-bound ground balls.  But, does that stand up to the evidence?  I say no.
The problem, of course, with Retrosheet data is that it doesn’t have hit location data (or at least very much) for recent years, and so anyone wanting to know about fielding in the past few years is reduced to making assumptions like this (or buying the BIS data).  However… there is a little bit of data that can be exploited on Retrosheet.  Because RS bought their 93-98 data from somewhere else (Project Scoresheet?) the 93-98 data have hit locations!  They use the Project Scoresheet location system, which uses a series of vectors to code for where the ball was either fielded or where it went through the infield.  I tossed out all of the balls that didn’t make it to the infield skin.  The infielder will make it to the dribbler and the bunt, no doubt.  Whether or not that will be in time for them to be any use is another issue.  But, can the infielder get to the ball before it gets to the outfield is an important first question because it’s the first step in throwing the batter out.
The careful reader will have noted that I’m not talking about completing plays and making outs, only about getting to the ball.  First off, it plays into my system on a larger scale.  Secondly, I’m reminded of the old adage about why errors are a faulty stat in that an error means that the fielder did something good in at least getting to the ball.  An infield hit is better than an outfield hit, and in order to get an out on a ground ball, an infielder needs to get to the ball.  (Yeah, you see the occasional 9-3 putout… about as often as I see my cousins who live in Phoenix.  Hi, Mike and Steve!)  So, here I’m looking at the Retrosheet data which indicates by whom the ball was fielded.  Whether or not the play was completed is irrelevant… for now.
Here’s what I did.  I took the 1993-1998 data and built a huge data base of ground balls.  I coded for pitcher and batter handedness (it makes a diference!  This had been noted by the ever-reliable John Walsh some time ago.), and, if the ball went to the outfield for a hit, whether or not the hit that resulted was a single or an extra base hit.  Then, I looked at the spread of balls hit to each zone and who was fielding the balls where.  I tossed out all bunts and anything that didn’t at least make it to the infield skin.  I had ten zones to work with, which can be seen here on this diagram.  It’s not quite what the Fielding Bible does (they have 17 zones), but the Retrosheet’s data are free
Let’s look at a ground ball single that gets through to the left fielder in a righty-righty pitcher-batter matchup.  What zone was it usually hit to?  Most often, and fairly obviously, to the hole between short and third (84.1% of the time), a zone marked “56” by Retrosheet.  But, sometimes (7.0%), it went to the zone marked “5” (because that’s where the third baseman is usually standing), and sometimes (6.0%) to “6” and sometimes (2.2%) to “5L” (down the left field line) and sometimes (0.5%) to “6M” (up the middle, to the shortstop side of second base).  There are some weird entries in there that are probably data entry errors (a hit to left field that went through the hole between first and second?) that account for the rest of the numbers (if you add, that’s only 99.8%).  We can re-create the same database for all handedness-type of hit-fielded by combos.  In fact, I did.  Something to note is that right-handed hitters were more likely to pull the ball toward more third base-ward zones (and lefties to shortstop-ward zones).  The effects weren’t huge, but they’re far enough away from 50-50 to be notable.
Now, who’s in charge of each of those zones?  That’s easy enough to figure out.  When the ball is hit to each of the zones, and it doesn’t scoot through, which infielder usually is the one to field it?  Again, looking at our righty-righty matchup, we get the following splits.
Zone   SS got it   3B got it
5L       1.1%         98.6%
5         0.3%         98.8%
56       41.9%       57.4%
6         97.6%       1.2%
6M     88.1%       0.1%  (the second baseman and pitcher pick up the other 11.8%)
Again, note that a right-handed hitter pulled the ball closer to the third baseman (see zone 56).  The pattern was slightly reversed for lefties, although not as extreme.  Now, it’s a matter of simple multiplication to see what share of the blame each of the two fielders should get for a hit to left field.  Since 84.1% of GB singles to left from righty-righty matchups go to zone “56”, and they are 57.4% the responsibility of the third baseman, then he gets 48.2% of the blame for GB singles to left, plus whatever other responsibilites he gets from the other four zones we’re focusing on.  In fact, he ends up with 54.2% of the blame for a single to left, given a righty-righty matchup.  It’s not 50-50, although in fairness to the other systems, it’s close.  When looking at hits to center fielder, the pattern becomes a little more pronounced, with more of a 60-40 split to the shortstop for right-handed batters and to the second baseman for left-handed batters.  50-50 isn’t going to cut it.
For a full breakdown of who’s to blame given some other combos, click here.
In the Retrosheet years where we don’t have hit location data, and all we know is that a GB single went through to the left fielder, we at least now have a better idea of where to place the blame among the infielders.
A few caveats.  One is the obvious fact that I’m going to be using data from 1993-1998 and assuming that it still holds up 10-15 years later.  Indeed, I’ve shown that baseball players are getting bigger and that they are probably getting slower.  This could certainly affect range and I suppose could in turn affect those numbers.  The other is in extending this system.  Dan Fox, when originally developing SFR had in mind a system that could be applied to minor league data.  This system assumes that minor leaguers hit like major leaguers and have similar spray charts.  It may very well be the case, but without hard data, we have no way to know.

Who the heck is Chris Antonetti?

For those of you who were paying attention to this week’s World Famous StatSpeak Roundtable, Eric asked the question of who would be the first GM to be fired.  Eric’s got an odd knack for these things.  A few weeks ago, he asked the question of who would throw the first no-hitter of the year.  The next day, Jon Lester went out and did that.  This time, on Monday afternoon, right after the Roundtable came out, Bill Bavasi of the Seattle Mariners was told that he should find an alternate line of work.
The next question is who will be the next General Manager of the Seattle Mariners, and the Mariner faithful over at U.S.S. Mariner seem to have chosen their champion in Chris Antonetti.  Now, lest we get ahead of ourselves, no one in the Mariners organization has said anything about him publicly nor has Antonetti said anything about the Seattle job, and this is just one blog’s speculation.  But, the guys over at USSM (including previous roundtable guest Dave Cameron) are usually pretty spot-on with these things… and it does kinda make sense.  Read on.
Who is Antonetti?  He’s an assistant GM in Cleveland, charged mostly with the quantitative analysis and contract negotiations.  He’s one of the reasons that the Indians have been so quick to embrace quantitative analysis (i.e. Sabermetrics) in their decision-making process.  The exact details of his biography aren’t all that important right now, but the ones of greatest relevance are these.  Most GMs are former players.  They may not have been major leaguers, but most of them logged some time in the minors.  Antonetti did not.  In fact, he’s only 32, which makes him younger than some of the players whom he would generally manage.  Antonetti, instead, has an academic background, with an advanced degree in sports management.  Cleveland GM Mark Shapiro leans on him to work the numbers.  Word on the street is that he’s very good at what he does, and would probably make someone an outstanding GM.
If the press reports are to be believed (and I believe everything that the media tells me), Antonetti was heavily considered for the St. Louis Cardinals GM vacancy last winter, as well as the Pittsburgh job (which went to fellow Indians’ assistant GM Neal Huntington, who is also Saber-sympathetic).  Reports were that the Indians lured Antonetti away from taking one of those jobs by making him a well-compensated man and promising that he would eventually succeed current Indians GM Mark Shapiro in a few years.  But, as Derek over at USS Mariner points out in his plea for Antonetti to come to the Great Northwest, there’s a lot to be said for the Seattle position being a good fit for someone of Antonetti’s ilk.  Derek points out that in addition to the lovely Seattle culture (I still have all my Nirvana CDs), he’d be in a relatively low-stress setting media-wise with a big budget and a surrounding community with a lot of high-powered technologically minded people (think: Microsoft lives in Seattle).  I don’t know Chris Antonetti personally, and I don’t know if he has any interest in taking the job (speaking as an Indians fan, I hope not…), but he would seem to be a really good candidate.  He’s been a big part of the Indians taking a mid-market payroll and turning it into a contending team.  Imagine what he could do with a license to rebuild the team from the ground up and ownership that would actually push the payroll into nine digits.
But Antonetti is something more than just a hot assistant GM being mentioned as a possible candidate for a job.  What happens with the Seattle situation and whether or not they approach Antonetti is a measuring stick in how far the Sabermetric movement has come in being accepted in the mainstream of baseball culture.  Would a team that has had Bavasi, considered to be a traditionalist in his methods, as their GM and has stuck by him for as long as they have turn about and pick a guy who’s much more from the Sabermetric school?  It’s not like there aren’t Saber-friendly GMs out there.  (I think I read somewhere that Billy Beane was rather amenable to the idea.)  But, an Antonetti hire would begin to represent a critical mass of acceptance.  Suddenly, there would be a few stat-head GMs around (Beane, Theo Epstein, Shapiro, Huntington) and the last few GM hires in the game would have at least had serious candidates who were statheads.
So, the Sabermetrician in me sees this as a possible defining moment.  Maybe it’s just the fact that I was a skinny, nerdy kid who could hit, run, throw, or field.  (I think my friends all just thought in unison, “was?”)  But does the fact that those of us who weren’t physically gifted persevered in the game we loved by seeing the game through the prism of reason and intellect mean that we can’t have a seat at the decision-making table?  More and more the answer is becoming we can have that spot at the table, and I’m happy with how far the movement has come, but this feels like it would be a clincher.  The Mariners could actually send the message that baseball is ready for a statistical revolution, that no longer will they be afraid of guys with calculators who might challenge the accepted wisdom.  Baseball might actually move into the Enlightenment.  An amazing thought.
Whether or not Chris Antonetti gets the job, I hope that the Mariners make a lot of noise about wanting him.  It’s up to him whether he would even be receptive to such overtures, but if the Mariners make it a point to pursue him (and loudly), there’s a message in there.  The Indians fan in me hopes that Chris Antonetti is happy to stay in Cleveland and enjoy some of that lovely Cleveland culture (Rock and Roll Hall of Fame!  Drew Carey!  Midges!  Me!), so that he can use the oodles of talent that he has to keep the Indians contending.  But maybe, just maybe, for the good of dragging baseball, kicking and whining into something bigger, I can be convinced to let Chris Antonetti go.

Follow

Get every new post delivered to your Inbox.