Retrosheet Game Logs
December 19, 2008 10 Comments
This weekend I was planning on updating my park factors, in preparation of Part II of “Different Factors For Different Folks”. I downloaded all the new event files from the RetroSheet website, not just 2008, but 1999, 1953 and all the updates and corrections for the other seasons. I then spent more than a few minutes waiting for my query to build a table for Split Data, and when it finished I discovered there was no 2008 data to be found.
I forgot to update the Retroheet game logs as well. With no 2008 entries in my Games table, there could be no matching output from the query. I then went back and downloaded all the game logs, but when I opened one to have a look, it didn’t match my table’s data structure.
There are actually two different versions of game logs available from RetroSheet data, and it’s important to know the differences and how to acess them. And I wasn’t using either one of them, as I will explain later.
When you go to RetroSheet’s data download section, they have game logs with an entry for every one of the 193,866 games played in major league history. There are fields for date, site, home team, away team, etc…but no game ID. The other logs are generated from information contained in the header section of each game in the event files. Ted Turoczy’s Chadwick includes cwgame.exe, which will extract the game info from the event files, and create a record for each game for which there are events on file. There is a field for game ID, plus a lot of fields the game logs don’t contain. However, the date field doesn’t have the “19″ or “20″ listed at the beginning of the year field – once event files for games from the 1800′s are available, they will have same dates as games 100 years later in the 1900′s.
Last year, I recognized some of these problems, and stitched the two files together to get the games table I have been using. Now I want to update my table, but neither of the original source files matches my table. Shovel the front walk off for the wife, go for a two mile walk around the neighborhood, then shower up, and I had the answer, which I will share in order that you don’t have to repeat my past mistakes.
When you have files created by others that you periodically wish to import and update, receive them in a table which exactly matches their existing data structure. Then, anytime in the future, you can append records, or overwrite the whole thing when you need to correct or update. Tom Tango has written the code for a blank CWGAME table at his wiki. I couldn’t find an equivalent for the Retro game logs, so I had to do it myself. There are 161 fields, and documentation at the RetroSheet website had a description for each. Where the fields were in common with CWGAME, I used the same name and datatype as Tango. Where the fields were unique, I was on my own, but I tried to follow the same naming conventions that Tango established. Also, there are 139 different files in the game logs, one for each season. If you don’t want to do 139 imports, at a command prompt type “copy gl*.txt gamelogs.txt”. This will copy all the files starting with “gl” and ending in “.txt” into one file, which is then the one you will import.
Once you have any original source files imported as-is, you might decide that they are not arranged in the way you would do it. Stuff isn’t normal. There are fields that aren’t necessary. You want some things from one table, and some from the other. When you want to manipulate the data, use a query. That way, when you come back to do the same thing next year, you only need to rerun the query and overwrite last year’s result, instead of comparing the files and trying to remember what you did the last time.
The Retro game logs contain a record for every game ever played in the major leagues. CWGAME outputs a record for every game where there are events on file. I would prefer to establish a one-to-one relationship between these two tables, so that CWGAME is an extension to the Game Logs, providing additional information where available.
GAME_ID is the primary key for CWGAME, but it does not exist as a seperate field in the Retro game logs. It must be created as HOME_TEAM_ID+GAME_DT+GAME_CT_CD. If this key is created only as part of a query, I do not know how to link the tables in a LEFT JOIN manner which would return all the fields in Game Logs and the matching fields in CWGAME. My solution is to use a make table query to create a modified version of the Game Logs, which is then linked to CWGAME.
Neither of the games tables has a field for year, which makes life much easier when coding many common queries. CWGAME can not be used, as it has a truncated date field which does not have left left two characters of the year. The Game Logs do have a full date. In Access I used the functions Left(GamesLOG.GAME_DT,4) AS [YEAR], Mid(GamesLOG.GAME_DT,5,2) AS [MONTH], Right(GamesLOG.GAME_DT,2) AS [DAY] to parse the field into three.
Once GAME_ID and YEAR have been added to a midified Game Logs table, it is ready to go, with CWGAME linked to it using the GAME_ID as a foreign key. Remember to use LEFT JOIN instead of INNER JOIN, so that you can return all records from Game Logs and the matching recods from CWGAME. If you wish, you can further modify the Game Logs table, mixing and matching the fields from the two input tables in whatever combination you wish, perhaps eliminating the need for a JOIN.
The Data Definition (DDL) code for CWGAME is at Tom Tango’s site (scroll down to the Games table). To use in Access, change “varchar” to “text”. Here is my DDL code for the Game Logs table in MySQL and Access format. To run the DDLs in Access, open a new query in Design View. Do not add any tables. Switch to SQL view. Paste in the DDL code, then save under a name of your choosing. To create a blank table, double click on the DDL query you’ve created.