Building a sabermetrician’s workbench, part I
November 14, 2008 33 Comments
There are a lot of budding sabermetricians out there. (Actually, I’m sure that’s not true. But it sure seems that way. Of course, this is because I tend to hang out in places, like Tango’s blog and Beyond The Boxscore, that are popular hangouts for budding sabermetricians.)
Now let’s face it – life is rough out there on the streets, cadging data from websites and working with spreadsheets that start to look like rat’s nests. You’re tired, cold and hungry. You want to just put away your “Will Work For Data” signs and do some real research.
There is a better way! That way is… the relational database. A relational database offers you all the brilliant promise of tomorrow, and I don’t mean some sort of Total Recall tomorrow but a totally boss tomorrow, like an Epcot Center tomorrow without all the vague creepiness.
What I’m writing here is, or at least is supposed to be, the sort of thing I wished I had about a year ago – a step-by-step explanation of how to get started using a database to do sabermetric research.
I am not a professional DBA admin, or a programmer of any kind – what you are getting is the approach of a dedicated hobbyist but a hobbyist nonetheless. I may not follow “best practices.” I will not be explaining a lot of database theory or lecturing you on the importance of being “relational.” Quite frankly, I don’t know too much about that myself. And if you have your own database setup and workflow already, your setup may differ from mine. I hope you can still derive some value from this.
Getting a database
First, you need a database. I use MySQL. If you do a Google search, you may well find a lot of arguing over whether or not MySQL is any good. (For what it’s worth, a lot of large companies and government agencies use MySQL for serious work.) You’ll hear them argue about things like “transactions” and the ACID model. None of this applies to folks like you and I, trust me. And the best part – it’s absolutely free.
Here’s how to install it:
Download MySQL 5.1 CE Essentials. (It should be the first link under “Windows Downloads.” If you aren’t running Windows… well, I suspect you know more about your Linux/Mac/Amiga machine than I do, so I leave that as an exercise to the enterprising reader.) Then, once you’ve downloaded it, run the MSI file.
Choose a “typical” install.
There will be some ads. Skip over them. Make sure that “Configure the MySQL server now” is checked. Chose “Detailed configuration.”
Then you’ll go through a set of pages asking you questions. Here are the answers:
- Developer Machine
- Non-Transactional Database Only
- Decision Support (DSS)/OLAP
Continue on, using the defaults.
Either add a password or uncheck the box to alter security settings. (This will leave your root password blank.)
That should do it! From here on out, you shouldn’t have to start MySQL at all – it should be ever vigilant, waiting for you to command it to do your sabermetric bidding.
Talking to the database
Let’s face it – your time is valuable, and while you’re not exactly lazy you really don’t want to do any more work than you have to. (Well, maybe you are lazy; I really wouldn’t know.) So as much fun as it sounds to sit there and labor over a dark command prompt, what you really want is a graphical front end to pretty up your interactions with the database.
Again sticking with the theme of “absolutely free,” I use SQLyog. You want the “Windows-Binary” from the Community Edition. Installation is straightforward – just download it and click “Next” a few times. Then run SQLyog.
From there, you’ll be presented with a screen asking you for the settings to connect to the database. Everything should be correct – if you selected a password, type it into the password blank. Then hit “Connect.”
And… voila! You’ve been connected to, well, to an empty database. I suspect you want some actual data in your database.
Putting the “data” in database
There’s obviously a lot of different things you could want to have in your database, depending on your area of interest. And at some point, you may want to dive into the exciting world of play-by-play data ,from Retrosheet and Gameday. But let’s start off modestly, if not exactly small. Let’s start with the Baseball Databank. It’s a record of pretty much every official statistic for every baseball player in the history of the majors. It’s a great resource, and a great starting point.
First, download the ZIP file. Extract the .SQL file from the ZIP, and remember when you put it. Now go into SQLyog.
In SQLyog there will be three panes – the left-hand side should list your databases, the top is where you write queries, and the bottom shows the results. Let’s add a database. Right-click in the pane, and select “Create Database.” Call it “BDB,” leaving off the punctuation. Leave the rest as the defaults, and click okay.
Now, let’s populate the database. Right-click on the database, and select “Restore from SQL Dump.” There should be an button with an elipsis (…) in it. Click that, and select the file we extracted from the ZIP. Select Execute. It’ll ask you if you want to execute the queries in database BDB. You do. This should take a little while.
Once that’s done, go into your database and click on the folder that says Tables. If nothing is there, try hitting “F5” on the top row of your keyboard to refresh the browser.
Go ahead and click on batting, and then slide on over to the right and click on "Table Data.” Scroll around for a bit. If you’ve ever used a spreadsheet or looked at a table before, the layout should feel familiar. Now it’s time to play around with the database a little.
Asking the database questions
The fundamental unit of the database is the table, which is further broken down into the row and the column. Each row is a “record” in the database; you can have as many records as you like. Each column is given a specific name, which is how you address it.
You won’t be entering formulas into cells, like you would a spreadsheet. You interact with the data in the spreadsheet by writing queries. We’ll start off with the simplest query possible.
Slide up into the query editor and type in this:
There is a green play button in the upper left-hand corner; press it. (Or just hit F9.) You should get… well, to be rather anticlimactic, you got the same table we were just looking at. Most of the column headers should be pretty self-explanatory – H is hits, SF sacrifice flies, etc. playerID is each player’s unique identifier, taken from the first five letters of the last name and the first two letters of the first name, plus a number. Some are easier to grok than others.
[A little note: for your convenience, SQL commands will be highlighted in blue in all of the code samples. SQLyog provides even more highlighting for your code.]
- SELECT is your bread-and-butter SQL command; it tells the database to fetch records out of the table.
- * is a wildcard; it tells the database to fetch all columns.
- FROM tells the database what table to fetch from.
- The semicolon tells the database that our instruction is complete.
Let’s expand upon our code sample a bit now. Say we only want to look at the 1990 Royals:
If SELECT is our bread and butter, WHERE is our butterknife. It tells the database to only go and fetch the records we need. What we’re telling it is to only give us records where the yearID column reads 1990 and the teamID column reads KCA. The quotation marks are important – if you don’t use quotation marks, MySQL will assume you’re referring to a column of that name.
There’s a special value here: wherever it says (NULL), it’s telling us that cell is empty. In the batting table, NULLs are generally used to indicate:
- Players who appeared in a game but had no official plate appearance (generally relief pitchers, pinch-runners and defensive replacements).
- Statistics that were not tracked that year (generally sacrifice flies, double plays, etc.).
In this case, we really only want to see players who went to the plate. There’s a way to handle NULL values:
And that should give us the raw counting stats of everyone who played on the Royals that year. Suppose we want to know the “sabermetric triple crown” stats of batting average, on-base percentage and slugging percentage. (Tango’s sabermetric wiki is a great source of formulas to try out, incidently.)
You don’t have to break up the different lines; I do that for legibility. You do want to seperate each thing you want to select by commas. I put the commas at the start of each individual line, but that’s a personal style preference. SQL simply ignores carriage returns (that is to say, line breaks), tabs and extra spaces, so feel free to “pretty up” your code in a way that makes it easy for you to read.
SQL will do any simply arithmetic with the records you provide. (You can also do a lot more complex math with it if you really want to.) The AS command is very helpful – otherwise the column will end up being named “(H+BB+IBB+HBP)/(AB+BB+IBB+HBP+SF),” which simply isn’t very convenient.
Let’s say we don’t want every column, we just want a few columns:
Bored of the 1990 Royals yet? Let’s try something different.
ORDER BY tells the database to sort the list; Desc tells it to go from large to small. (Asc is the corresponding command to sort from small to large.) LIMIT tells SQL to only return the first 50 records.
So that’s the top 50 home-run seasons. (Supply your own asterisks where you feel it necessary.) But what if we want career totals?
The SUM command will sum everything the column unless you stop it. GROUP BY tells the database to combine everything based upon player ID.
Now what if want to look at the top 50 career batting averages? We obviously don’t want to see some pitcher who lucked his way into going 2 for 3 and never saw the majors again. So let’s say we want a minimum 3000 career plate appearances.
Because we’re using the SUM function, we have to use the HAVING function instead of the GROUP BY function. Note the >= – that’s greater than or equal to.
So, what’s that COALESCE doing there? Remember, sacrifice flies weren’t officially recorded until 1954. For years prior, those values are NULL in the database. Trying to sum a NULL value chucks everything out. Using the COALESCE function in this way tells the database to substitute 0 in the place of a NULL. Otherwise, players like Ty Cobb would be excluded from the list (precisely the sorts of players we don’t want excluded!)
We’re just scratching the surface here, but alas the rest will have to continue next week. A few quick notes and then I’ll leave you to your own devices for a while.
First, some helpful resources:
- The official MySQL reference manual
- W3Schools SQL Reference
- Tizag’s MySQL tutorial
- FirstSQL’s SQL tutorial
One final tip for SQLyog – if you right-click on a table, you’ll have options to export as a CSV file or copy to the clipboard (perfect for dumping data into an Excel spreadsheet.)
And now, a little challenge for you. In addition to the batting table, there’s a pitching table as well. Try writing a query to list the 50 lowest career earned run averages, minimum 1000 innings pitched. (Here’s a little tip: the Databank doesn’t list IP for pitchers, simply outs. Remember to divide by three.)
See you next week!