Database Toolbox: Tables

First, I want to commend Colin for Part I and Part II of his series on queries for relational databases. However, a part which I feel has been overlooked is proper design of the tables in your database. Poor design will complicate querying as well as maintenance and data entry for the database.

Currently I have copies of the Lahman database, Sean Foreman’s Baseball DataBank, KJOK’s ballparks, Westbay’s Japanese database, RetroSheet and Pitch f/x. All of them have their areas of concentration. None have a complete set of tables. Mostly they are not normalized. I greatly appreciate the efforts of the people who built these databases, but I am looking for a comprehensive design that has the tables and relationships that will allow me to store all the data in one, and will work for any set of data – majors, minor, college, amateur, you name it. And, I want to be able to update it daily from mlb.com’s GameDay files. It can be done, and I’ve been back working on it the past week. I’m not done yet, but I have made quite a bit of progress. I’m nearing completion of the tables for season level data, then will proceed to the game and pitch level. Meanwhile, I wrote this tutorial on tables to show my understanding of how relational databases should be constructed, which will then explain the logic and necessity behind my design.

Now, on to today’s lesson:

Tables are basically lists of things. I like to refer to two types of tables, objects and events. Objects are things, like people, teams and ballparks. You would want one row for each, then a column for each description of the object. Events happen, and involve combinations of objects, with columns to describe who, what, when, where. Most often, event tables lookup data referenced from object tables.

The first thing to do is make a list of the objects and events that will be a part of your database, and then start listing the descriptions of each. Players have a name, date and place of birth, date and place of death (if applicable), batting hand, throwing hand, height and weight. Umpires have a name, date and place of birth, and date and place of death. Those are all fields we just listed for players, although the players have additional fields. What players and umpires have in common is they are all people – so let’s create a Persons table that lists those things common to all persons, regardless of their role on the field. The Players and Umpires table will link to the Persons table, and can inherit all the columns listed there. Now, what about height and weight? Did Barry Bonds or Kevin Mitchell weigh the same in their 30′s as they did when they were rookies? Some data never changes (birthdate, unless someone stole their cousin’s birth certificate) and some data can change every year (weight, or even batting hand). Those things which can be different each season should be put in a table which would contain a record for each combination of year and player, team or ballpark.

Database professionals have devised a set of rules to “normalize” the tables in a database. The main goal is to eliminate redudancy – we don’t want the same piece of data entered more than once.

1st normal form

Don’t have more than one column doing the same function

Don’t have duplicate rows

Today, there are normally four umpires for each Major League game. In the post-season, there are six. We could create six columns, UmpHP, Ump1b, etc, in our Games table to hold the codes for the umpire assigned to that position during that game. During the regular season, the two columns for lf and rf will be left blanl. A hundred years ago, there may have only been two umpires, four columns will be left blank. What hapens if there are seven umpires? Instead, normalization requires a new table for umpire assignments, which would have as fields GameID, UmpireID and UmpirePos. With one record for each umpire in each game, there is no limit on the number of umpires, and there are no blank records.

To eliminate duplicate rows, databases will insist on having one or more fields defined as a “primary key”, the value of which will be unique in each record. There will be an option to autonumber the records (first entered has an ID of 1, second 2, etc), or the user can select data fields. For example, RetroSheet has given each Major League team, player and ballpark in history an alpha-numeric code. Because these codes are unique, they can be used as the primary key of the teams, players and ballparks tables.

Using codes instead of auto IDs is most helpful when merging sets of data. Suppose I have different databases for the majors, minors and Japan, all using the same table layout. Then I decide to merge these into one database. If the tables use autonumbers for their primary and foreign keys, team “1″ in the major league db might be Altoona Mountain City (ALT), while team “1″ in the Japan db might be Chiba Lotte Marines (CHB). The players on the Marines, who have “1″ as their team number, may now be listed as being on Altoona. The codes can be designed so that all primary keys of lookup tables have unique values across all leagues, and thuis will maintain correct linkages after merges.

2nd normal form

Eliminate repeating values in a column

In our persons table, there’s a field for place of birth. 163 players have been born in Cuba, 15116 in the United States. We could repeat United States, or U.S., or US, or USA, more than fifteen thousand times, or we could put a single record in a new Countries table, and have every person link to that record. If a name is mispelled, it only needs to be corrected in one place. It’s much less of a hassle for the database administrator (you) to verify that there’s only one version of the name being used, so that a query asking for all the players born in “USA” wouldn’t miss the players in “US” or “U.S.” etc. Instead of the using the actual text in the CountryBorn field, it would hold a “Foreign Key”, a code or number which matches the Primary Key of the table being used as a lookup. The Countries table can then hold additional detail in it’s columns, such as Abbreviation and FullName.

3rd normal form

No columns calculated from other columns

If H and AB are already in the batting table, there is no need to also include BA as it is derived from the other fields. If there’s an error, and someone’s hits need to be changed, the BA also would need to be changed. If a new record is entered, the BA would have to be calculated before entering the data. Queries are the method of doing calculations based on the data in the tables. Many baseball databases put everything into the tables and have no queries, intending the user to browse the tables. MySQL calls queries “Views” for a reason – that is where you are meant to go to view the data.

There are three types of relationships between tables.

One to One

Since the late 1980′s, we’ve had basic pitch data available. First pitch was a ball, second was a strike, etc. A Pitches table would have PA_ID, PitchNum, Result. A couple years ago pitch f/x added more than a dozen new fields for each pitch. Instead of going back to the Pitches table and adding all those new fields, all of which will be blank for the years when pitch f/x was not available, we should create a new Pitchfx table, which acts as an extension of the Pitches table. Records in both tables are identified by PA_ID and PitchNum. There can be no more than one Pitchfx record for each Pitches record, and vice versa. One pitch, one pfx record.

One to Many

Each plate appearance can have one or more pitches, each with their own record and description in the Pitches table, but each of those pitches can only belong to one plate appearance. One PA, many pitches.

Many to Many

A player can be, at one time or another, a member of many teams. Teams have many players. To show these relationships, a third (event) table is needed to show all the combinations of the two object tables, showing PlayerID, TeamID, year and then whatever detail applies, such as batting or pitching statistics. Many players, many teams.

Here are some examples of MySQL code to generate the structure for a few tables. After the tables are created and linked, data can be inserted. At this time, I want to focus on the PRIMARY KEY and FOREIGN KEY statements. Please note that these tables are for illustrations only. They do not represent a complete operational database as some fields have been deleted for brevity, and some other tables are not shown.

The Batting table uses PRIMARY KEY (`Player_CD`,`Team_CD`,`Year`) to identify the three fields that in combination make each record unique. FOREIGN KEY (`Player_CD`,`Year`) REFERENCES Players (`Player_CD`,`Year`) relates the Batting table to the primary key of the Players table, to get information about that player in that year. The Players table in turn uses Player_CD to reference to Person_CD in the Persons table, where name and birth information is available.

The Batting table also uses Team_CD and Year to reference the same fields in TeamSeasons. TeamSeasons in turn references tables that can return, for the given year, the team’s league, location, name, and home ballpark. In this way team TBA can be the Tampa Bay Devils Rays in 2007, but then the Tampa Bay Rays in 2008.

DROP TABLE IF EXISTS `Batting`;

CREATE TABLE `Batting` (

`Player_CD` varchar(9) NOT NULL default ”,

`Year` smallint(4) unsigned NOT NULL default ’0′,

`stint` smallint(2) unsigned NOT NULL default ’0′,

`Team_CD` char(3) NOT NULL default ”,

`G` smallint(3) unsigned default NULL,

`AB` smallint(3) unsigned default NULL,

`R` smallint(3) unsigned default NULL,

`H` smallint(3) unsigned default NULL,

`DO` smallint(3) unsigned default NULL,

`TR` smallint(3) unsigned default NULL,

`HR` smallint(3) unsigned default NULL,

`RBI` smallint(3) unsigned default NULL,

`SB` smallint(3) unsigned default NULL,

`CS` smallint(3) unsigned default NULL,

`BB` smallint(3) unsigned default NULL,

`SO` smallint(3) unsigned default NULL,

`HBP` smallint(3) unsigned default NULL,

`IBB` smallint(3) unsigned default NULL,

`SH` smallint(3) unsigned default NULL,

`SF` smallint(3) unsigned default NULL,

`GDP` smallint(3) unsigned default NULL,

PRIMARY KEY (`Player_CD`,`Year`,`stint`),

FOREIGN KEY (`Player_CD`,`Year`) REFERENCES Players (`Player_CD`,`Year`),

FOREIGN KEY (`Team_CD`,`Year`) REFERENCES TeamSeasons (`Team_CD`,`Year`)

) ENGINE=MyISAM DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS `Players`;

CREATE TABLE `Players` (

`Player_CD` varchar(10) NOT NULL default ”,

`Year` smallint(4) unsigned NOT NULL default ’0′,

`bats` enum(‘L’,’R’,’B’) default NULL,

`throws` enum(‘L’,’R’,’B’) default NULL,

`weight` int(3) default NULL,

`height` double(4,1) default NULL,

PRIMARY KEY (`Player_CD`,`Year`),

FOREIGN KEY (`Player_CD`) REFERENCES PERSONS (`Person_CD`)

) ENGINE=MyISAM AUTO_INCREMENT=18543 DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS `Persons`;

CREATE TABLE `Persons` (

`Person_CD` varchar(10) NOT NULL default ”,

`nameLast` varchar(50) NOT NULL default ”,

`nameFirst` varchar(50) default NULL,

`nameGiven` varchar(255) default NULL,

`birthYear` int(4) default NULL,

`birthMonth` int(2) default NULL,

`birthDay` int(2) default NULL,

PRIMARY KEY (`Person_CD`),

) ENGINE=MyISAM AUTO_INCREMENT=18543 DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS `TeamSeasons`;

CREATE TABLE `TeamSeasons` (

`Year` smallint(4) unsigned NOT NULL default ’0′,

`Franchise_CD` char(3) NOT NULL default ‘UNK’,

`Team_CD` char(3) NOT NULL default ”,

`TeamLocation_CD` char(3) default NULL,

`TeamSponsor_CD` char(3) default NULL,

`TeamName_CD` char(3) default NULL,

`League_CD` char(3) NOT NULL default ”,

`Division_CD` char(1) default NULL,

`Park_CD` char(5) default NULL,

PRIMARY KEY (`Team_CD`,`Year`),

FOREIGN KEY (`Franchise_CD`) REFERENCES Franchises (`Franchise_CD`),

FOREIGN KEY (`Team_CD`) REFERENCES Teams (`Team_CD`),

FOREIGN KEY (`TeamLocation_CD`) REFERENCES TeamLocations (`TeamLocation_CD`),

FOREIGN KEY (`TeamSponsor_CD`) REFERENCES TeamSponsors (`TeamSponsor_CD`),

FOREIGN KEY (`TeamName_CD`) REFERENCES TeamNames (`TeamName_CD`),

FOREIGN KEY (`League_CD`) REFERENCES Leagues (`League_CD`),

FOREIGN KEY (`Division_CD`) REFERENCES Divisions (`Division_CD`),

FOREIGN KEY (`Park_CD`,`Year`) REFERENCES ParkSeasons (`Park_CD`,`Year`)

) ENGINE=MyISAM DEFAULT CHARSET=latin1;

 

 

Recapping the BIP

Before even getting into the meat of this article, no, the title does not refer to Bip Roberts… so I’ll understand if hardcore fans of his are now turned off.  What the title does refer to, however, is balls in play and how they pertain to the statistics BABIP, FIP, and ERA.  I have written a lot here and on my other stomping grounds of late about how some of these statistics are affected and, seeing as it is a holiday weekend with not much interweb usage, it seemed like the logical time to recap everything into one neat package.  For starters, what are these three statistics?
BABIP: Batting Average on Balls In Play is a statistical spawn of the DIPS theory discovered by Voros McCracken at the turn of the century.  Essentially Voros found that pitchers have next to no control over balls put in play against them, which is why certain pitchers would surrender a ton of hits one year and much less the next.  From a control standpoint, the goal of the pitcher would be to get an out.  Once a ball is put in play, unless it is hit right back to the pitcher many defensive aspects have to coincide for an out to result.  Take a groundball for instance, one between shortstop and third base: both fielders have to understand whose territory the ball occupies and that fielder has to have the proper range in order to field it, all in a very short amount of time. 
There are plenty of other variables as well but what should be clear is that the pitcher has no control over them.  He may have control over sustaining a certain percentage of balls in play each year but the hits that result are almost entirely out of his hand.  In fact, the only aspects of pitching over which he has any type of control are walks, strikeouts, and home runs allowed.  Everything else is dependant on the fielding and luck.
BABIP is calculated by dividing the Hits minus Home Runs by the Plate Appearances excluding Home Runs, Walks, Strikeouts, and Sacrifice Flies.  If Player A has 30 hits out of 90 at-bats he will post a .333 batting average.  But if 8 of those 30 hits are home runs and 8 of the outs are strikeouts, in BABIP terms he would be 22 for 74, or .297.  This explains that, of all balls put in play–any hit or batted out other than a home run–29.7% fell in for hits.
FIP: a creation of Tom Tango’s, Fielding Independent Pitching takes the three controllable skills of walks, strikeouts, and home runs allowed, properly weights them, and then scales the result similar to the familiar ERA.  The end result explains what a pitcher’s skillset suggests his ERA should be around.  Someone with an ERA much lower than their FIP is usually considered to be lucky while the inverse is also true.  The statistic is kept at Fangraphs and ERA-FIP was recently added as well in order to allow readers a glimpse at those under- or overperforming their controllable skills.
ERA: arguably the most popular pitching barometer, ERA can be calculated by multiplying the earned runs of a pitcher by nine and dividing that product by the total number of innings pitched.  While not a terrible stat it suffers from some pretty drastic noise.  For starters, what are earned runs?  The surname ‘earned’ implies there are other runs that can be given up and that these must satisfy a specific criteria.  For instance, if a fielder botches a routine play with two outs, and the pitcher then gives up seven runs, none will be earned because the inning was extended by the poor play of the fielder.  This gets into all sorts of questions regarding exactly what an error is and how that factors into a pitcher’s performance.
Earned runs are also a direct result of hits, which have been proven to be largely accrued through chance via the DIPS theory.  So, if pitchers cannot control the percentage of hits they give up on balls in play, then fluctuations in hits can either inflate or deflate an ERA regardless of the pitcher’s skill level.  Therefore the FIP is more indicative of performance level because it only measures the three aspects of pitching he has control over which should not suffer from much fluctuation at all, as Pizza Cutter showed not too long ago that these skills were some of the quickest to stabilize.
Controlling BABIP
At Fangraphs we occasionally call upon a statistic we titled xBABIP, which refers to what the BABIP of a pitcher can be expected to be given his percentage of line drives.  Dave Studeman found a few years back that the general range of BABIP could be predicted with very good accuracy by adding .12 to the LD%; if a pitcher surrendered 22.1% line drives his xBABIP would be ~.341.  Using this for predictive purposes would not be correct due to the fact that the general baseline for pitchers is .300.  What we can do is evaluate performance at a given time and attribute line drives to a rather high or low BABIP.  For instance, saying that Player B’s BABIP of .275 as of today primarily due to his ultra-low 14-15% LD rate would be correct; saying that it will continue like this would not.  The line drive percentage may change as the season goes on.  In summation, we can use something like this when evaluating the past for pitchers but not the future.
David Appelman showed not too long ago that, in 2007, 15% of flyballs fell in for hits, 24% of grounders turned into hits, and a whopping 73% of line drives also followed suit.  Due to this, the ideal xBABIP calculation would be .15(FB) + .24(GB) + .73(LD).
I have done studies here recently, and Jonathan Hale at Baseball Digest Daily has done others in the past as well, that show how aspects like velocity, movement, and location can all affect the BABIP of a given pitcher.  It also been shown, again by Studeman, that elite relievers have the ability to consistently post lower BABIPs than others.  More studies have shown that pitchers, if any, have very weak control over their BABIP but instead of deeming it control I would be more inclined to say that these pitchers are merely taking advantage of “cold spots.” 
If just 15% of flyballs result in hits and such a large number of line drives do, then we could intuitively expect someone with consistently low LD rates and higher FB rates to post lower BABIPs.  From a movement perspective, I found that those with above average vertical movement in different horizontal movement subgroupings post lower BABIPs as well.  Higher vertical movement usually correlates to flyballs, and voila, flyballs have the lowest percentage of hits.
This was just a recap of the three statistics and explanations pertaining to their usage.  Based on this, if we see someone like Carlos Zambrano, whose ERA consistently beats his FIP, based on consistently posting lower BABIPs, we could somewhat safely assume that he might not be controlling anything persay but rather taking advantage of all the aspects proven to result in lower BABIPs.  His controllable skills may not be as good as his ERA would suggest but movement, velocity, and location may have combined to greatly aid his efforts.

Does Movement Influence BABIP?

A couple weeks back, Pizza Cutter found an interesting oddity in that Troy Percival had consistently posted very, very low BABIPs. In response, Dave Studeman brought up Mariano Rivera–another pitcher with consistently low BABIPs–and how it has been somewhat proven that elite relievers can register atypical results with this statistic. Mentioned on a few other sites was the idea that movement may be a central cause for these lower batting averages on balls in play; due to said movement, the sweet part of the bat would fail to meet the ball as it normally would on more “standard” pitches.
Last week, we explored the relationship between fastballs 92+ mph and BABIP, examining how it differed at each mile per hour interval. 92 mph to 96 mph clocked in between .290-.310–the established general range of BABIP for pitchers–before dipping to .273 at 97 mph and shooting back up to .293 for all thrown 98 mph or higher. The 97 and 98+ groups were too small in their sample sizes to definitively fail the 5% hypothesis; we would need around 1,650 balls in play and, combined, had 1,032. Still, the combo of 97 and 98+ offered a .279 BABIP, perhaps suggesting that the .293 at 98+ was the anomaly, not the .273.
Today we will look at the movement within the same 92+ mph range in order to attempt to answer the question posed in the title. First, though, a pre-requisite of sorts with regards to movement: the relationship between horizontal and vertical components is not extremely known yet other than some telltale signs aiding in the classification of pitches. For instance, a two-seam fastball will have much higher horizontal movement than vertical movement; however, four-seam fastballs generally have lower horizontal movement and higher vertical movement.
I queried my database for all fastballs 92+ mph and separated the results into groups by movement rather than velocity intervals. The signs (+-) were reversed so that righties and lefties could be grouped together as well. First, here is a sample size grid of sorts, showing all balls in play for each horizontal group and vertical subgroup; note that the subgroups differ for each horizontal movement grouping so they will be called simply below average or above average as they were essentially determined by the average or a similar type of cutoff point. The reasoning for this is the aforementioned relationship between movement components; for fastballs, lower horizontal movement will usually correlate with higher vertical movement with the inverse also being true.

Horizontal

Below Vert BIP

Above Vert BIP

0-4 in

3,735

2,456

4-8 in

6,823

4,718

8-12 in

4,355

3,227

12+ in

408

335

BABIP takes a while to stabilize, moreso than many other statistics, so I wanted to have at least 2,000 balls in play for each sub-grouping, preferably more. From 0-12 inches of horizontal movement we have large enough samples to notice discrepancies. Greater than 12 inches, however, offers just 743 balls in play. While I definitely plan to explore this and the velocity articles later in the year when more data is available, for now, I am going to exclude the group with more than 12 horizontal inches.
Looking at the other three groups and their two subgroupings each, here are the Ball%, Strike%, HR%, and BABIP:

Horiz.

Vert.

B%

K%

HR%

BABIP

0-4

Below

35.9

45.6

0.53

.289

0-4

Above

34.9

49.8

0.48

.286

4-8

Below

35.8

43.7

0.64

.302

4-8

Above

35.8

48.2

0.58

.292

8-12

Below

35.6

41.4

0.54

.315

8-12

Above

36.5

45.6

0.58

.298

The percentage of balls essentially stays in the same general range while the strikes fluctuate. The subgroupings with above average vertical movement have much higher strike percentages than others. So, judging by this it seems before we even get to BABIP, that higher vertical movement in these larger groups result in a higher percentage of strikes.

The BABIPs for horizontal movement groups with below average vertical movement register: .289, .302, and .315. The BABIPs for horizontal movement with above average vertical movement clock in at: .286, .292, .298. Judging from these results it would appear that, yes, movement does have some type of effect on BABIP. Each horizontal group posted higher counts when they had below average vertical movement, and at every interval as well; .289 to .286, .302 to .292, and .315 to .298. Additionally, all pitches 92+ mph with 0-4 inches of horizontal movement, regardless of whether or not they fell above or below the vertical cutoff point, produced a BABIP lower than .290, which is generally the lower edge of the .290-.310 range we expect it to fall into.

Tomorrow I’ll come right back with the total number of unique pitchers and those comprising at least 1% and at least 5% of the sample, in order to see if the results are skewed in any way. For now, though, it appears that, regardless of your horizontal movement, having above average vertical movement will produce a lower BABIP at each horizontal interval.

Heater Getting Hotter

Yesterday we looked at the averages of fastballs from different velocity groups as a means to compare certain pitchers to their like-throwing peers as opposed to an extremely broad group.  This way, we can compare Matt Cain’s movement to the average movement for all 94 mph fastballs to determine how effective it has been.
In doing so an anomaly surfaced: all velocity groups had a BABIP between .290-.310 except those thrown 97 mph.  Those heaters registered a .273 BABIP, nearly 20 points below the others.  Sure enough, fastballs registering 98 mph or higher jumped back to .293, leading many of us to believe something screwy, flukey, or any other adjective ending with the suffix “-y” slapped on its end, was taking place.  After exploring some logical possibilities, like a split-half reliability test, or a look at BABIP by count and location, the results either stuck or were inconclusive due to small sample sizes at work.
We had a really nice discussion in the comments section wherein more possibilities were tossed around.  The first of these suggestions involved testing the sample size via a Bernoulli Trial.  As was shown by commenter Adam Guetz, for an observed .273 when a .295 was expected, we would need approximately 1,650 balls in play.  For 97 mph pitches there were 707 balls in play, less than half of what is required, and just 325 balls in play for 98+ mph.  While the sample sizes of actual pitches thrown are large enough to conduct certain analyses, those of balls in play for anything 97 mph or higher were not.  Here are the BIP sample sizes:

  • 92 mph, 18.85 % BIP and 7,759 total
  • 93 mph, 18.05% BIP and 6,023 total
  • 94 mph, 18.05% BIP and 4,389 total
  • 95 mph, 17.04% BIP and 2,827 total
  • 96 mph, 17.26% BIP and 1,596 total
  • 97 mph, 16.69% BIP and 707 total
  • >98 mph, 16.11% BIP and 325 total

The samples from 92-96 appear large enough, but the combination of 97 and 98+ still comes a good 500 pitches below 96 mph on its own.  Another suggestion called for the total number of different pitchers as each interval as well as the number of those comprising certain percentages of the samples.  This way, we might be able to deduce that 97 mph pitches were skewed due to a small group representing the whole; for the lower velocities, which are more common, it is much more likely for the pitches to be more evenly divided amongst a larger group of pitchers.  Here are the number of pitchers for each group, those comprising 1% of the sample, and those comprising 5% of the sample:

  • 92 mph: 574 total pitchers, 8 at 1%, 0 at 5%
  • 93 mph: 485 total pitchers, 18 at 1%, 0 at 5%
  • 94 mph: 516 total pitchers, 21 at 1%, 0 at 5%
  • 95 mph: 337 total pitchers, 25 at 1%, 0 at 5%
  • 96 mph: 237 total pitchers, 28 at 1%, 1 at 5%
  • 97 mph: 160 total pitchers, 25 at 1%, 4 at 5%
  • >98 mph: 102 total pitchers, 18 at 1%, 8 at 5%

In the 97 mph group, the four pitchers with at least 5% of the sample combine to represent 23% of the total.  For 98+ mph, the eight pitchers with at least 5% of the sample combine to represent 56% of the total.
From these results it seems that 92-96 mph are safe from a drastic case of small sample size syndrome.  Anything abobe 97 mph, though, seems to be the opposite as they suffer from a small sample of balls in play as well as skewed results due to a small group of pitchers representing most of the total pitches. 
Another commenter, Dave Evans, pointed out that he received a significance of 0.55 when comparing 97 and 98+, meaning their BABIPs were not statistically significantly different; for significance, that value would need to be equal to or below 0.01.  This led me to group 97 and 98+ together, to enlarge the sample.  The result was 1,032 balls in play, 288 hits in play, and a .279 BABIP.  This suggested the possibility that perhaps it was not 97 mph that deserved the adjective+suffix “-y” treatment but rather 98+ mph pitches.  Granted, it is still a small sample, even moreso for BABIP, but perhaps we will find out, as more data becomes available, that 97 mph is the threshold, as Pizza Cutter noted, for “blowing it by the hitter.”
It will require several hundred more pitches in play to determine this with any certainty but I will be keeping very close tabs as the season progresses.  For now, though, we can effectively compare individual pitchers to the average movement components, B%, K%, and BABIP for their specific velocity, not an entire group, at least for heaters 92 mph to 96 mph.

Breaking Down the Heater

Back on December 20th, John Walsh wrote a very interesting article at The Hardball Times, taking everything recorded by the Pitch F/X system in 2007 and, amongst others, calculating the average velocity, horizontal movement, and vertical movement for the four major pitches: fastball, curveball, slider, and changeup.  The results showed that the average fastball clocked in at 91 mph with -6.2 inches of horizontal movement and 8.9 inches of vertical movement.  The author acknowledged that he did not differentiate between four-seamers, two-seamers, and cutters, but rather lumped them all together in determining the averages; two-seamers and cutters differ in velocity and movement components from four-seamers.

While I plan on calculating the averages for all different sub-groupings of pitches at some point, what recently piqued my interest was finding the averages for different velocity groupings.  As in, what is the average horizontal movement for all 94 mph fastballs?  Or, the BABIP for 98 mph fastballs? 
With that knowledge we could effectively compare certain pitchers to the means of their velocity grouping rather than overall averages of every grouping.  Instead of comparing, say, Edwin Jackson’s 94 mph fastball to a group including those who throw slower, we can compare him to his “peers.” 
I started at 92 mph and queried my database for groupings (92-92.99, 93-93.99, etc) all the way up until 98+ mph.  I figured 92 mph would be a solid starting point since the sample size would be extraordinarily large–large enough for four-seamers to overcome the two-seamers and cutters that may inevitably sneak in.  Anything 98 mph or higher was grouped together to ensure a large enough sample since, as you will see below, the higher the velocity, the smaller the sample:

Velocity

Sample

%

92 mph

41,157

31.4

93 mph

33,368

25.5

94 mph

24,315

18.6

95 mph

16,586

12.7

96 mph

9,245

7.1

97 mph

4,236

3.2

>98 mph

2,018

1.5

All of the sample sizes here were large enough for analysis.  Even though the 98+ group appears to be 1/20th the size of the 92 mph group, that speaks more for the latter than against the former.
Next, how do the movement components look for each group?

Velocity

Horiz.

Vert.

92 mph

-6.34

9.24

93 mph

-6.28

9.51

94 mph

-6.16

9.80

95 mph

-5.98

10.07

96 mph

-5.84

10.23

97 mph

-5.89

10.41

>98 mph

-6.03

10.38

It should be fairly apparent that the tendency is for horizontal movement to decrease and vertical movement to increase as the velocity increases, at least through 96 mph.  At 97 mph, both movement components increase.  At 98+ mph, the vertical movement stays stagnant while the horizontal movement jumps quite a bit.
The next area to discuss includes B%, K%, HR%, and BABIP:

Velocity

B%

K%

HR%

BABIP

92 mph

35.9

44.6

0.65

.302

93 mph

36.3

45.1

0.55

.303

94 mph

35.5

45.9

0.55

.292

95 mph

35.8

46.4

0.76

.303

96 mph

35.2

47.0

0.54

.291

97 mph

36.1

46.8

0.41

.273

>98 mph

33.9

49.3

0.69

.293

The percentage of balls doesn’t move too much until its dip of over two percentage points at 98+ mph.  The amount of strikes, however, seems to increase.  There is no real discernible pattern in the home run percentages; the most came on 95 mph heaters while the least came on those registering 97 mph.

Speaking of the 97 mph group, notice anything odd?  Perhaps that their BABIP is .273, a full eighteen points below any other group?  Prior to getting the results I expected each group to fall somewhere in the .290-.310 range; that all of them did except the .273 struck me as very peculiar.

I spoke to several other analysts, all of whom initially mentioned small sample size syndrome, only to redact the assessment after learning the sample sizes in question.  The dropoff in home run percentage was tossed around, as well, since less home runs means more balls in play to be counted in the BABIP formula.  This is a “could be,” though, rather than a “definitely why.”  As was mentioned in these discussions, too, it could be nothing; perhaps there were more warning track flyballs that just missed leaving the yard as opposed to weaker hit balls.

Now, while the 4,236 pitches at 97 mph constitutes a large enough sample to analyze, the balls in play were not large enough yet to break into individual counts or locations.  When they do get big enough this could serve as a means of explanation; perhaps something in either or both does not jive with the other velocity groups.  Of those with significance, however, there was a .263 BABIP on 0-0 counts, and a .286 BABIP on pitches in the middle of the strike zone.

Pizza Cutter, or “The Master of Statistical Reliability” as I like to call him (yeah, a nickname for a nickname), suggested that BABIP is one of those stats that is super-unreliable, even with my large sample of pitches.  I did a split-half reliability test, randomly splitting the sample in half, and calculating the BABIP of each half.  For those unfamiliar, this serves to test the reliability of the sample; if it truly is large enough then no matter how we cut the sample in half we will have fairly convergent results.  If the results were wildly divergent then we are dealing with an unreliable sample.  The BABIPs of the two groups were .271 and .275, which essentially threw that idea out of the window.

Something interesting to consider was how, in each of these tables, all patterns seemed to stop when they reached 97 mph or higher.  The horizontal movement increased instead of its decreasing trend; vertical movement decreased after its increase at 97; the percentage of strikes ceased increasing; and home runs reached their low.  Could be something, could be nothing, but interesting nonetheless.

For now I am going to chalk this BABIP drop as an extreme random statistical variation and hope that you loyal readers out there might chime in with some more ideas to investigate.  Otherwise, though, when gauging the movement components, percentage of balls/strikes/home runs, or even BABIP, we can compare individual pitchers to their “like-minded” averages by velocity grouping.  If I get enough feedback involving different aspects to measure regarding these fastballs we will look at that soon, in the next day or two.  Otherwise, next week I have something similar to this, looking at BABIP by movement.

Juuust A Bit Outside

On Thursday we took a look at the pitchers with the highest percentage of Pitch F/X-recorded pitches right down the middle of the plate.  I listed the top thirty out of the 165 pitchers with significant numbers and found that Ted Lilly of the Cubs has thrown the highest percentage; on top of that, the next pitcher on the list found himself relatively far off.  Today we are going to look at the opposite: The pitchers with the highest percentage of pitches outside the zone.
Now, outside the zone calls for four general parameters: very high, very low, outside to the left, and outside to the right.  I feel like I’m typing the Cha-Cha slide.
For now I am going to focus on the left/right parameters outside the strike zone, and we will explore high/low a bit later in the year as I have other ideas centering around those parameters.  As discussed previously, the strike zone on a general pitch location chart goes from -0.83 to 0.83 on the horizontal axis and 1.6 to 3.5 on the vertical axis.  To track pitches down the middle the axis numbers were set much smaller.  To track pitches outside the zone the horizontal axis numbers branch out in different directions.  For pitches outside to the left I set my database to give me all pitches with a PX (horizontal location in the data) less than -1.55 as well as greater than +1.55.
This provided me plenty of pitches to analyze but keep in mind that the data was not insanely consistent last year with regards to who gets recorded and where the recording takes place.  This year it has become more consistent and uniform but there may be data discrepancies due to some players having insufficient data.  For instance, Player A might be known to throw a ton of pitches out of the zone but, because the Pitch F/X system did not track many of his starts, he might not qualify. 
To help ensure the pitchers in the below leaderboard did not fall into this statistical fallacy, a minimum of 240 raw pitches was set.  That certainly whittled the list down.  The total tracked pitches were then recorded for all remaining pitchers, and they were then sorted by % instead of raw total.  Here are the top ten:
1) Livan Hernandez, 15.83%
2) Derek Lowe, 12.43%
3) Jake Peavy, 12.39%
4) Chad Gaudin, 11.56%
5) Braden Looper, 11.56%
6) John Smoltz, 11.40%
7) Jamie Moyer, 11.20%
8) Justin Germano, 11.01%
9) Jeff Francis, 10.73%
10) A.J. Burnett, 10.71%
I did not necessarily predict that Livan would be atop this leaderboard but, at the same time, it was not very surprising to find his name there, with a significant lead over the next pitcher nonetheless.  Moyer didn’t surprise me either as he’s a notorious “junkballer.”  Here are 11-20:
11) Jarrod Washburn, 10.59%
12) Carlos Zambrano, 10.05%
13) Shaun Marcum, 10.03%
14) Tim Hudson, 9.78%
15) Javier Vazquez, 9.66%
16) Kevin Millwood, 9.63%
17) Jose Contreras, 9.54%
18) Miguel Batista, 9.33%
19) Roy Halladay, 9.29%
20) Vicente Padilla, 9.14%
Something really interesting here is the emergence of Burnett, Marcum, and Halladay.  I noted in the comments on Thursday that, of pitchers with significant data, Burnett, Marcum, and Halladay were in the bottom ten of percentage of pitches thrown right down the middle; here they are in the top twenty of pitches thrown outside the zone.  I noted at Fangraphs a week or two ago that the Blue Jays rotation, arguably the best in the bigs both last year and this year, consisted of three guys (McGowan, Marcum, Litsch) who threw four or five different pitches at least 10% of the time, somewhat of an extreme rarity.  Additionally, Halladay has a potent three-pitch combo, and Burnett has a plus-fastball and plus-curveball.
Put together it seems like the Blue Jays pitchers are spreading their pitch selections quite liberally, rarely making mistakes in throwing the ball right down the middle, and not worrying about being outside the strike zone.  Perhaps this means nothing with regards to their performance, but it is interesting nonetheless that a rotation like this appears in the leaderboards in three different areas of selection/location.
As we get deeper into the season enough data will be compiled to look at both down the middle and outside pitches solely for 2008, when the data is tracked in each park.  For now, though, we’ll have to settle with Ted Lilly and Livan Hernandez.  If only those two faced each other this year.

Right Down the Middle

Last week I took a look at the relationship between pitches and home runs, checking to see if there were any noticeable discrepancies between those that sail out of the stadiums and those that do not.  The results showed that fastballs turned into souvenirs when they came in with lesser velocities and movements as well as with poor location; breaking balls were hit out when they hung in the zone.

While conducting these analyses I became very interested in pursuing the idea of mistake pitches and balls thrown not just in the zone but right down the middle.  Of all the balls that were hit for home runs from the top home run surrendering pitchers this year, at least 80% were smackdab in the middle of the plate.  Since this piqued my interest I decided to check out which pitchers threw down the middle most often.
The strike zone, in Pitch F/X terms, is generally -0.85 to 0.85 on the horizontal axis and 1.6 to 3.5 on the vertical axis.  I went smaller, looking at pitches in the middle of that zone, as evidenced by this picture:

strikezone.JPG


Probing my database for pitches in the smaller box–what I would consider to be down the middle–I found a ton of pitches.  Keep in mind, though, that the results below are from pitches tracked by the Pitch F/X system; there are some pitchers that might have a higher total or percentage but did not have the luxury of having their relevant data recorded.
I found 165 pitchers with a significant number of pitches down the middle.  Luckily, in terms of using neat/even numbers in a list, the top 30 percentages happened to consist of everyone with at least 14% of their pitches thrown down the middle.  Here are the top ten:
1) Ted Lilly, 18.6%
2) Paul Byrd, 16.7%
3) Josh Beckett, 16.3%
4) Micah Owings, 16.1%
5) Tim Lincecum, 15.9%
6) John Danks, 15.8%
7) Felix Hernandez, 15.7%
8) Greg Maddux, 15.5%
9) Joe Blanton, 15.5%
10) Justin Verlander, 15.4%
Lilly threw just about two percent more pitches down the middle than his closest competitor whereas #2-#10 were separated by a total 1.3 percent.  Numbers 11-20:
11) Andy Sonnanstine, 15.3%
12) Kevin Millwood, 15.2%
13) Cole Hamels, 15.1%
14) Aaron Harang, 15.0%
15) Brian Bannister, 14.8%
16) Daisuke Matsuzaka, 14.7%
17) Vicente Padilla, 14.7%
18) Matt Cain, 14.7%
19) Javier Vazquez, 14.7%
20) Randy Wolf, 14.6%
And the last group with at least 14% of their pitches down the middle:
21) Brad Penny, 14.5%
22) Roy Oswalt, 14.5%
23) Johan Santana, 14.4%
24) Nate Robertson, 14.3%
25) Ervin Santana, 14.2%
26) Miguel Batista, 14.2%
27) Jon Garland, 14.1%
28) John Lackey, 14.1%
29) CC Sabathia, 14.1%
30) Jarrod Washburn, 14.0%
Unfortunately, just as David Appelman found a couple of years ago, there is not much correlation between pitches thrown down the middle and, well, anything else at all.  I thought there might be something significant between down the middle pitches and line drives–it’s been theorized before that line drives might correlate quite well with mistake pitches–but, alas, there was not; at least not yet.
Additionally, I would like to explore this at the end of this season, or perhaps further into the year, when all pitchers would have the same (or close to it) amount of data recorded.  For now, though, at the very least, it’s somewhat interesting to see which pitchers throw the most down the middle.
On Saturday we will look at the opposite, pitchers who throw the most OUT of the zone and then compare the results (Balls, Called K, Swing K, etc) between pitches down the middle and those out of the zone.

Follow

Get every new post delivered to your Inbox.