Database Toolbox: Tables
December 5, 2008 6 Comments
First, I want to commend Colin for Part I and Part II of his series on queries for relational databases. However, a part which I feel has been overlooked is proper design of the tables in your database. Poor design will complicate querying as well as maintenance and data entry for the database.
Currently I have copies of the Lahman database, Sean Foreman’s Baseball DataBank, KJOK’s ballparks, Westbay’s Japanese database, RetroSheet and Pitch f/x. All of them have their areas of concentration. None have a complete set of tables. Mostly they are not normalized. I greatly appreciate the efforts of the people who built these databases, but I am looking for a comprehensive design that has the tables and relationships that will allow me to store all the data in one, and will work for any set of data – majors, minor, college, amateur, you name it. And, I want to be able to update it daily from mlb.com’s GameDay files. It can be done, and I’ve been back working on it the past week. I’m not done yet, but I have made quite a bit of progress. I’m nearing completion of the tables for season level data, then will proceed to the game and pitch level. Meanwhile, I wrote this tutorial on tables to show my understanding of how relational databases should be constructed, which will then explain the logic and necessity behind my design.
Now, on to today’s lesson:
Tables are basically lists of things. I like to refer to two types of tables, objects and events. Objects are things, like people, teams and ballparks. You would want one row for each, then a column for each description of the object. Events happen, and involve combinations of objects, with columns to describe who, what, when, where. Most often, event tables lookup data referenced from object tables.
The first thing to do is make a list of the objects and events that will be a part of your database, and then start listing the descriptions of each. Players have a name, date and place of birth, date and place of death (if applicable), batting hand, throwing hand, height and weight. Umpires have a name, date and place of birth, and date and place of death. Those are all fields we just listed for players, although the players have additional fields. What players and umpires have in common is they are all people – so let’s create a Persons table that lists those things common to all persons, regardless of their role on the field. The Players and Umpires table will link to the Persons table, and can inherit all the columns listed there. Now, what about height and weight? Did Barry Bonds or Kevin Mitchell weigh the same in their 30′s as they did when they were rookies? Some data never changes (birthdate, unless someone stole their cousin’s birth certificate) and some data can change every year (weight, or even batting hand). Those things which can be different each season should be put in a table which would contain a record for each combination of year and player, team or ballpark.
Database professionals have devised a set of rules to “normalize” the tables in a database. The main goal is to eliminate redudancy – we don’t want the same piece of data entered more than once.
1st normal form
Don’t have more than one column doing the same function
Don’t have duplicate rows
Today, there are normally four umpires for each Major League game. In the post-season, there are six. We could create six columns, UmpHP, Ump1b, etc, in our Games table to hold the codes for the umpire assigned to that position during that game. During the regular season, the two columns for lf and rf will be left blanl. A hundred years ago, there may have only been two umpires, four columns will be left blank. What hapens if there are seven umpires? Instead, normalization requires a new table for umpire assignments, which would have as fields GameID, UmpireID and UmpirePos. With one record for each umpire in each game, there is no limit on the number of umpires, and there are no blank records.
To eliminate duplicate rows, databases will insist on having one or more fields defined as a “primary key”, the value of which will be unique in each record. There will be an option to autonumber the records (first entered has an ID of 1, second 2, etc), or the user can select data fields. For example, RetroSheet has given each Major League team, player and ballpark in history an alpha-numeric code. Because these codes are unique, they can be used as the primary key of the teams, players and ballparks tables.
Using codes instead of auto IDs is most helpful when merging sets of data. Suppose I have different databases for the majors, minors and Japan, all using the same table layout. Then I decide to merge these into one database. If the tables use autonumbers for their primary and foreign keys, team “1″ in the major league db might be Altoona Mountain City (ALT), while team “1″ in the Japan db might be Chiba Lotte Marines (CHB). The players on the Marines, who have “1″ as their team number, may now be listed as being on Altoona. The codes can be designed so that all primary keys of lookup tables have unique values across all leagues, and thuis will maintain correct linkages after merges.
2nd normal form
Eliminate repeating values in a column
In our persons table, there’s a field for place of birth. 163 players have been born in Cuba, 15116 in the United States. We could repeat United States, or U.S., or US, or USA, more than fifteen thousand times, or we could put a single record in a new Countries table, and have every person link to that record. If a name is mispelled, it only needs to be corrected in one place. It’s much less of a hassle for the database administrator (you) to verify that there’s only one version of the name being used, so that a query asking for all the players born in “USA” wouldn’t miss the players in “US” or “U.S.” etc. Instead of the using the actual text in the CountryBorn field, it would hold a “Foreign Key”, a code or number which matches the Primary Key of the table being used as a lookup. The Countries table can then hold additional detail in it’s columns, such as Abbreviation and FullName.
3rd normal form
No columns calculated from other columns
If H and AB are already in the batting table, there is no need to also include BA as it is derived from the other fields. If there’s an error, and someone’s hits need to be changed, the BA also would need to be changed. If a new record is entered, the BA would have to be calculated before entering the data. Queries are the method of doing calculations based on the data in the tables. Many baseball databases put everything into the tables and have no queries, intending the user to browse the tables. MySQL calls queries “Views” for a reason – that is where you are meant to go to view the data.
There are three types of relationships between tables.
One to One
Since the late 1980′s, we’ve had basic pitch data available. First pitch was a ball, second was a strike, etc. A Pitches table would have PA_ID, PitchNum, Result. A couple years ago pitch f/x added more than a dozen new fields for each pitch. Instead of going back to the Pitches table and adding all those new fields, all of which will be blank for the years when pitch f/x was not available, we should create a new Pitchfx table, which acts as an extension of the Pitches table. Records in both tables are identified by PA_ID and PitchNum. There can be no more than one Pitchfx record for each Pitches record, and vice versa. One pitch, one pfx record.
One to Many
Each plate appearance can have one or more pitches, each with their own record and description in the Pitches table, but each of those pitches can only belong to one plate appearance. One PA, many pitches.
Many to Many
A player can be, at one time or another, a member of many teams. Teams have many players. To show these relationships, a third (event) table is needed to show all the combinations of the two object tables, showing PlayerID, TeamID, year and then whatever detail applies, such as batting or pitching statistics. Many players, many teams.
Here are some examples of MySQL code to generate the structure for a few tables. After the tables are created and linked, data can be inserted. At this time, I want to focus on the PRIMARY KEY and FOREIGN KEY statements. Please note that these tables are for illustrations only. They do not represent a complete operational database as some fields have been deleted for brevity, and some other tables are not shown.
The Batting table uses PRIMARY KEY (`Player_CD`,`Team_CD`,`Year`) to identify the three fields that in combination make each record unique. FOREIGN KEY (`Player_CD`,`Year`) REFERENCES Players (`Player_CD`,`Year`) relates the Batting table to the primary key of the Players table, to get information about that player in that year. The Players table in turn uses Player_CD to reference to Person_CD in the Persons table, where name and birth information is available.
The Batting table also uses Team_CD and Year to reference the same fields in TeamSeasons. TeamSeasons in turn references tables that can return, for the given year, the team’s league, location, name, and home ballpark. In this way team TBA can be the Tampa Bay Devils Rays in 2007, but then the Tampa Bay Rays in 2008.
DROP TABLE IF EXISTS `Batting`;
CREATE TABLE `Batting` (
`Player_CD` varchar(9) NOT NULL default ”,
`Year` smallint(4) unsigned NOT NULL default ’0′,
`stint` smallint(2) unsigned NOT NULL default ’0′,
`Team_CD` char(3) NOT NULL default ”,
`G` smallint(3) unsigned default NULL,
`AB` smallint(3) unsigned default NULL,
`R` smallint(3) unsigned default NULL,
`H` smallint(3) unsigned default NULL,
`DO` smallint(3) unsigned default NULL,
`TR` smallint(3) unsigned default NULL,
`HR` smallint(3) unsigned default NULL,
`RBI` smallint(3) unsigned default NULL,
`SB` smallint(3) unsigned default NULL,
`CS` smallint(3) unsigned default NULL,
`BB` smallint(3) unsigned default NULL,
`SO` smallint(3) unsigned default NULL,
`HBP` smallint(3) unsigned default NULL,
`IBB` smallint(3) unsigned default NULL,
`SH` smallint(3) unsigned default NULL,
`SF` smallint(3) unsigned default NULL,
`GDP` smallint(3) unsigned default NULL,
PRIMARY KEY (`Player_CD`,`Year`,`stint`),
FOREIGN KEY (`Player_CD`,`Year`) REFERENCES Players (`Player_CD`,`Year`),
FOREIGN KEY (`Team_CD`,`Year`) REFERENCES TeamSeasons (`Team_CD`,`Year`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
DROP TABLE IF EXISTS `Players`;
CREATE TABLE `Players` (
`Player_CD` varchar(10) NOT NULL default ”,
`Year` smallint(4) unsigned NOT NULL default ’0′,
`bats` enum(‘L’,'R’,'B’) default NULL,
`throws` enum(‘L’,'R’,'B’) default NULL,
`weight` int(3) default NULL,
`height` double(4,1) default NULL,
PRIMARY KEY (`Player_CD`,`Year`),
FOREIGN KEY (`Player_CD`) REFERENCES PERSONS (`Person_CD`)
) ENGINE=MyISAM AUTO_INCREMENT=18543 DEFAULT CHARSET=latin1;
DROP TABLE IF EXISTS `Persons`;
CREATE TABLE `Persons` (
`Person_CD` varchar(10) NOT NULL default ”,
`nameLast` varchar(50) NOT NULL default ”,
`nameFirst` varchar(50) default NULL,
`nameGiven` varchar(255) default NULL,
`birthYear` int(4) default NULL,
`birthMonth` int(2) default NULL,
`birthDay` int(2) default NULL,
PRIMARY KEY (`Person_CD`),
) ENGINE=MyISAM AUTO_INCREMENT=18543 DEFAULT CHARSET=latin1;
DROP TABLE IF EXISTS `TeamSeasons`;
CREATE TABLE `TeamSeasons` (
`Year` smallint(4) unsigned NOT NULL default ’0′,
`Franchise_CD` char(3) NOT NULL default ‘UNK’,
`Team_CD` char(3) NOT NULL default ”,
`TeamLocation_CD` char(3) default NULL,
`TeamSponsor_CD` char(3) default NULL,
`TeamName_CD` char(3) default NULL,
`League_CD` char(3) NOT NULL default ”,
`Division_CD` char(1) default NULL,
`Park_CD` char(5) default NULL,
PRIMARY KEY (`Team_CD`,`Year`),
FOREIGN KEY (`Franchise_CD`) REFERENCES Franchises (`Franchise_CD`),
FOREIGN KEY (`Team_CD`) REFERENCES Teams (`Team_CD`),
FOREIGN KEY (`TeamLocation_CD`) REFERENCES TeamLocations (`TeamLocation_CD`),
FOREIGN KEY (`TeamSponsor_CD`) REFERENCES TeamSponsors (`TeamSponsor_CD`),
FOREIGN KEY (`TeamName_CD`) REFERENCES TeamNames (`TeamName_CD`),
FOREIGN KEY (`League_CD`) REFERENCES Leagues (`League_CD`),
FOREIGN KEY (`Division_CD`) REFERENCES Divisions (`Division_CD`),
FOREIGN KEY (`Park_CD`,`Year`) REFERENCES ParkSeasons (`Park_CD`,`Year`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;