Creating A Retrosheet Database, Part I

Okay, I’m presuming that you’ve got an SQL database set up along the lines of the sabermetrician’s workbench I wrote a tutorial on. If you don’t, go do that and come back. I’m also assuming that you run Windows of some sort. I know almost nothing about getting DOS programs to work on other platforms and I know of no tools to parse the event files for any other platform. This is adapted from work by Tom Tango and Mat Kovach.

You are also going to be required to unzip files. Windows XP and later comes with the ability to unzip files, but I use 7-Zip. It’s not necessary, but it will save you a lot of time later.

Alright. Ready? Let’s do this.

Warning up front: this may be very time consuming. If your computer is slow, you may be best of running the loaders overnight and then checking on them the next morning.

Start off setting up a directory structure to house the necessary files. Make sure you have a lot of hard drive space free – I’d suggest at least 15gb, maybe more. Here’s the directory structure I use:

  • C:\Retrosheet
  • C:\Retrosheet\Data\
  • C:\Retrosheet\Data\zipped\
  • C:\Retrosheet\Data\unzipped\
  • C:\Retrosheet\Data\parsed\
  • C:\Retrosheet\common\
  • C:\Retrosheet\common\programs\
  • C:\Retrosheet\loaders\

There’s more, but that will do for now. If you would prefer to do it another way, that’s fine, but then you’re responsible for changing, oh, 200+ lines of batch processing and loader code.

Now download Chadwick and unzip the contents into:

C:\Retrosheet\common\programs

Now for the grunt work. Retrosheet maintains a page of all of the event files. Each year has its own page (before 1997, in fact, it was broken down by league.) On each page you will find a link that says “Entire _______ League.” Download that file into your C:\Retrosheet\Data\zipped\ folder. Yes, this is time consuming. Yes, it probably could be automated, although the process for doing so has eluded me. I only used the files from 1953 onward.

Now we want to unzip these files. If you downloaded 7-Zip, you can simply open the zipped folder, highlight all the zip files, and right click on them all. There should be a menu that says 7-Zip, which will have an option on it called “Extract Files…” You want to unzip them into the C:\Retrosheet\Data\unzipped\ folder.

Okay, now download
RetroSQL_Loaders.zip and unpack the files inside. There will be three batch files - $cwevent.bat, $cwgame.bat and $cwsub.bat. Place those in the same folder as the extracted files and run them. This will be time consuming, but you should be able to use your computer normally while they run.

Now, take the other files and put them into C:\Retrosheet\loaders\. Use SQLyog to load the files – the keyboard shortcut Ctrl-Shift-Q will let you excutue SQL commands from a file, or right-click in the left-hand pane and select “Restore From SQL Dump.” Run the one called retrosheet_tables.sql first – it doesn’t matter what database you select when running it, as it will create a database called “retrosheet” itself. (Warning – if you already have a database called that, be very careful, as you might lose tables.)

The other three files are loaders for the parsed files you just made. These can take a long time, and may tie up your computer. I suggest running the events query overnight.

These loader queries will populate backup tables for the games and events tables, not the actual tables themselves. Why? Because in order to make these tables work more efficiently, we want to partition them into smaller parts first. (These tables can be huge, and queries on them can be very slow.) To transfer the data from the backups into the real tables, use:

UPDATE games_bck
SET YEAR_ID = SUBSTR(GAME_ID,4,4);
 
UPDATE events_bck
SET YEAR_ID = SUBSTR(GAME_ID,4,4);
 
INSERT INTO games
SELECT * from games_bck;
 
INSERT INTO events
SELECT * from events_bck;

 

(Again, warning: these queries can take a loooooong time.)

And that should give you a fully armed and operational play-by-play database spanning over 50 years of baseball. Now, what to do with it? I’ve pulled some code I wrote a while back for you to play with and get an idea of what you can do with the SQL database. It creates yearly run expectancy tables:

CREATE TABLE re_zero
AS
SELECT    YEAR_ID
    , OUTS_CT
    , SUM(IF(BAT_FATE_ID>3,1,0))/SUM(IF(BAT_EVENT_FL = "T",1,0)) AS BAT_RE
    , SUM(IF(RUN1_FATE_ID>3,1,0))/SUM(IF(RUN1_ORIGIN_EVENT_ID > 0,1,0)) AS RUN1_RE
    , SUM(IF(RUN2_FATE_ID>3,1,0))/SUM(IF(RUN2_ORIGIN_EVENT_ID > 0,1,0)) AS RUN2_RE
    , SUM(IF(RUN3_FATE_ID>3,1,0))/SUM(IF(RUN3_ORIGIN_EVENT_ID > 0,1,0)) AS RUN3_RE
    , SUM(FATE_RUNS_CT + EVENT_RUNS_CT - IF(RUN3_FATE_ID>3,1,0) - IF(RUN2_FATE_ID>3,1,0) - IF(RUN1_FATE_ID>3,1,0) - IF(BAT_FATE_ID>3,1,0))/COUNT(1) AS FATE_RE
FROM retrosheet.events_copy e
GROUP BY YEAR_ID, OUTS_CT;
 
CREATE TABLE BASES_CD AS
SELECT DISTINCT START_BASES_CD
    , IF(RUN1_ORIGIN_EVENT_ID > 0,1,0) AS R
UN1
    , IF(RUN2_ORIGIN_EVENT_ID > 0,1,0) AS RUN2
    , IF(RUN3_ORIGIN_EVENT_ID > 0,1,0) AS RUN3
FROM retrosheet.events_copy
WHERE YEAR_ID = 2008
ORDER BY START_BASES_CD;
 
CREATE TABLE RE_TEMP AS
SELECT YEAR_ID
    , OUTS_CT
    , START_BASES_CD AS BASES_CD
    , BAT_RE+(RUN1_RE*RUN1)+(RUN2_RE*RUN2)+(RUN3_RE*RUN3)+FATE_RE AS RE
FROM bases_cd, re_zero;
 
CREATE TABLE RE
AS
SELECT * FROM (SELECT * FROM re_temp
UNION ALL
SELECT DISTINCT
YEAR_ID
    , 3 AS OUTS_CT
    , BASES_CD
    , 0 AS RE
FROM re_temp) a
ORDER BY YEAR_ID, OUTS_CT, BASES_CD;
 
CREATE INDEX re_idx
ON re(YEAR_ID,OUTS_CT,BASES_CD);

It’s a bit more complex than what you might see elsewhere – this is to address sample size issues with single-season run expectancy tables. Later on I can show you what you can do with RE tables – make your own linear weights? No problem! Baserunning evaluation? Easy cheesy!

In the meantime, please, think of anything you want to see code for. Any of my articles where I’ve used Retrosheet data is fair game, either here or at THT – if you want to see how I did it, just ask and I’ll see about cleaning up the code.

If you’re looking for additional places to go for info, try the RetroSQL list or the BaSQL wiki/forum.

About these ads

19 Responses to Creating A Retrosheet Database, Part I

  1. Graham Goldbeck says:

    Colin, is the subs.sql file supposed to be blank?

  2. Colin Wyers says:

    No. I’ll get to work on fixing it. That said, I’ve never used the subs table before, so you can get by without it in the meantime.

  3. john says:

    Yeah the sub files are blank and when I run these two queries I dont think I get anything
    UPDATE games_bckSET YEAR_ID = SUBSTR(GAME_ID,4,4);
    UPDATE events_bckSET YEAR_ID = SUBSTR(GAME_ID,4,4);
    Thanks for writing this. I been wanting to have a play by play database for so long now.

  4. Colin Wyers says:

    John,
    Check the _bck tables and see what the YEAR_ID entries look like. If they’re null, those queries didn’t work. If there are years in there, those queries worked.

  5. john says:

    Yeah the YEAR_ID entries are null. I might be doing something wrong. This is what I get when I try to run the update queries
    Error Code : 1064
    You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘= SUBSTR(GAME_ID,4,4)’ at line 1

  6. Colin Wyers says:

    I should have noticed sooner – there’s a missing space between _bck and SET. Something about the way the tables are formatted for the SQL code, I think.

  7. Dave says:

    Does the yearly run expectancy table query work as-is? Is retrosheet.events_copy supposed to be retrosheet.events_bck? The query gives an error using retrosheet.events_copy but runs with retrosheet.events_bck. But then it only creates the bases_cd table and not any of the others (such as re).

  8. Colin Wyers says:

    This is what I get for taking six-month old code and not testing or even reading it before posting. It should run on retrosheet.events. Once you’ve populated retrosheet.events and retrosheet.games, you can delete the _bck tables.
    Unfortunately I’m at work right now, but I’ll try to post an updated RE matrix script tonight (there’s some other things I’m noticing now) as well as something to actually use it on, like LWTS.

  9. Matty D says:

    So I created a table from the by event files, but it seems to be missing most batted ball locations. I’m even just looking at more recent years and it doesn’t seem to be there. Did I do something wrong or is that data fairly rare every year?

  10. Colin Wyers says:

    Hit location data is essentially limited to the Project Scoresheet data, ’89-’99. Most years it’s either spotty or simply absent.

  11. Matty D says:

    Ah, didn’t realize that. Thanks for the info.

  12. ChapelHeel66 says:

    Everything works fine until after I run the loader for the tables. The loaders for the other three files don’t work.
    Subs is empty (explained above)
    Games gives me an immediate error message as follows:
    Error occured at:2009-03-24 21:00:47
    Line no.:85
    Error Code: 29 – File ‘C:\Retrosheet\data\parsed\games1953.csv’ not found (Errcode: 2)
    I get the same error with the events loader, except it is line 149.
    There’s nothing in the /parsed directory.
    Also, what is Chadwick doing? There’s nothing in the instructions that asks us to use that program, so I’m confused about that. Should I be doing something with Chadwick to populate the /parsed directory?

  13. ChapelHeel66 says:

    Nevermind. I re-ran the batch files and it worked this time. Not sure why it didn’t created the parsed files to start with, but oh well.
    Thanks!

  14. ChapelHeel66 says:

    I spoke too soon. The batch files create files in the parsed folder, but the ones they create are empty. Just placeholders.
    Also, all the stuff about it taking a long time? None of that has happened. When I run the event batch file or game batch file, it takes less than a second.
    All the files from retrosheet (from 1953 NL to 2008 ML) have been unzipped and are in the unzipped folder. When I open those files, they look fine.

  15. ChapelHeel66 says:

    For those following instructions above, after you download Chadwick to your disk and move it to C:\Retrosheet\Common\Programs, you then have to run it. Do not use the default directory in the installer. Change the path to C:\Retrosheet\Common\Programs.
    However, the batch files are trying to call programs:
    C:\Retrosheet\common\programs\cwgame.exe
    C:\Retrosheet\common\programs\cwevent
    C:\Retrosheet\common\programs\cwsub
    After installing Chadwick, those programs are not in my C:\Retrosheet\common\programs directory. Chadwick does not install them.
    I even tried running them from the command prompt, and it did not work. They just aren’t there.

  16. ChapelHeel66 says:

    Alright, got it now.
    Folks, when you download from Chadwick, do not download v.0.4.0. Download the v.0.5.0.
    I know the message says the even version numbers are stable and the odd version numbers are unstable. That’s why I downloaded 0.4.0 originally. But if you do that, you get a standalone Chadwick program.
    Only 0.5.0 will give you the separate programs that are called upon by the batch files listed in the instructions above!

  17. Lucas says:

    I used the function ‘wget’ (to download http files automatically) within a bash program to save the zip files. Wget is a standard linux function but can also be found in a cygwin installation on windows.
    For example to get the NL files (run this while in the directory you want the files to be downloaded to):
    ——————————-
    #!/bin/bash
    http_add=http://www.retrosheet.org/year/yearnl.zip
    for ((y=1951;y

  18. Lucas says:

    Sorry. Should be:
    for ((y=1951;y

  19. weskelton says:

    Hi Colin,
    Don’t know if your still monitoring this thread. I recently tried to join BaSQL but was never confirmed. It looks kind of dead over there. Is it no longer an active forum?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: