Creating A Retrosheet Database, Part I
March 7, 2009 19 Comments
Okay, I’m presuming that you’ve got an SQL database set up along the lines of the sabermetrician’s workbench I wrote a tutorial on. If you don’t, go do that and come back. I’m also assuming that you run Windows of some sort. I know almost nothing about getting DOS programs to work on other platforms and I know of no tools to parse the event files for any other platform. This is adapted from work by Tom Tango and Mat Kovach.
You are also going to be required to unzip files. Windows XP and later comes with the ability to unzip files, but I use 7-Zip. It’s not necessary, but it will save you a lot of time later.
Alright. Ready? Let’s do this.
Warning up front: this may be very time consuming. If your computer is slow, you may be best of running the loaders overnight and then checking on them the next morning.
Start off setting up a directory structure to house the necessary files. Make sure you have a lot of hard drive space free – I’d suggest at least 15gb, maybe more. Here’s the directory structure I use:
There’s more, but that will do for now. If you would prefer to do it another way, that’s fine, but then you’re responsible for changing, oh, 200+ lines of batch processing and loader code.
Now download Chadwick and unzip the contents into:
Now for the grunt work. Retrosheet maintains a page of all of the event files. Each year has its own page (before 1997, in fact, it was broken down by league.) On each page you will find a link that says “Entire _______ League.” Download that file into your C:\Retrosheet\Data\zipped\ folder. Yes, this is time consuming. Yes, it probably could be automated, although the process for doing so has eluded me. I only used the files from 1953 onward.
Now we want to unzip these files. If you downloaded 7-Zip, you can simply open the zipped folder, highlight all the zip files, and right click on them all. There should be a menu that says 7-Zip, which will have an option on it called “Extract Files…” You want to unzip them into the C:\Retrosheet\Data\unzipped\ folder.
Okay, now download
RetroSQL_Loaders.zip and unpack the files inside. There will be three batch files - $cwevent.bat, $cwgame.bat and $cwsub.bat. Place those in the same folder as the extracted files and run them. This will be time consuming, but you should be able to use your computer normally while they run.
Now, take the other files and put them into C:\Retrosheet\loaders\. Use SQLyog to load the files – the keyboard shortcut Ctrl-Shift-Q will let you excutue SQL commands from a file, or right-click in the left-hand pane and select “Restore From SQL Dump.” Run the one called retrosheet_tables.sql first – it doesn’t matter what database you select when running it, as it will create a database called “retrosheet” itself. (Warning – if you already have a database called that, be very careful, as you might lose tables.)
The other three files are loaders for the parsed files you just made. These can take a long time, and may tie up your computer. I suggest running the events query overnight.
These loader queries will populate backup tables for the games and events tables, not the actual tables themselves. Why? Because in order to make these tables work more efficiently, we want to partition them into smaller parts first. (These tables can be huge, and queries on them can be very slow.) To transfer the data from the backups into the real tables, use:
(Again, warning: these queries can take a loooooong time.)
And that should give you a fully armed and operational play-by-play database spanning over 50 years of baseball. Now, what to do with it? I’ve pulled some code I wrote a while back for you to play with and get an idea of what you can do with the SQL database. It creates yearly run expectancy tables:
It’s a bit more complex than what you might see elsewhere – this is to address sample size issues with single-season run expectancy tables. Later on I can show you what you can do with RE tables – make your own linear weights? No problem! Baserunning evaluation? Easy cheesy!
In the meantime, please, think of anything you want to see code for. Any of my articles where I’ve used Retrosheet data is fair game, either here or at THT – if you want to see how I did it, just ask and I’ll see about cleaning up the code.