Database Toolbox: Tables

First, I want to commend Colin for Part I and Part II of his series on queries for relational databases. However, a part which I feel has been overlooked is proper design of the tables in your database. Poor design will complicate querying as well as maintenance and data entry for the database.

Currently I have copies of the Lahman database, Sean Foreman’s Baseball DataBank, KJOK’s ballparks, Westbay’s Japanese database, RetroSheet and Pitch f/x. All of them have their areas of concentration. None have a complete set of tables. Mostly they are not normalized. I greatly appreciate the efforts of the people who built these databases, but I am looking for a comprehensive design that has the tables and relationships that will allow me to store all the data in one, and will work for any set of data – majors, minor, college, amateur, you name it. And, I want to be able to update it daily from mlb.com’s GameDay files. It can be done, and I’ve been back working on it the past week. I’m not done yet, but I have made quite a bit of progress. I’m nearing completion of the tables for season level data, then will proceed to the game and pitch level. Meanwhile, I wrote this tutorial on tables to show my understanding of how relational databases should be constructed, which will then explain the logic and necessity behind my design.

Now, on to today’s lesson:

Tables are basically lists of things. I like to refer to two types of tables, objects and events. Objects are things, like people, teams and ballparks. You would want one row for each, then a column for each description of the object. Events happen, and involve combinations of objects, with columns to describe who, what, when, where. Most often, event tables lookup data referenced from object tables.

The first thing to do is make a list of the objects and events that will be a part of your database, and then start listing the descriptions of each. Players have a name, date and place of birth, date and place of death (if applicable), batting hand, throwing hand, height and weight. Umpires have a name, date and place of birth, and date and place of death. Those are all fields we just listed for players, although the players have additional fields. What players and umpires have in common is they are all people – so let’s create a Persons table that lists those things common to all persons, regardless of their role on the field. The Players and Umpires table will link to the Persons table, and can inherit all the columns listed there. Now, what about height and weight? Did Barry Bonds or Kevin Mitchell weigh the same in their 30’s as they did when they were rookies? Some data never changes (birthdate, unless someone stole their cousin’s birth certificate) and some data can change every year (weight, or even batting hand). Those things which can be different each season should be put in a table which would contain a record for each combination of year and player, team or ballpark.

Database professionals have devised a set of rules to “normalize” the tables in a database. The main goal is to eliminate redudancy – we don’t want the same piece of data entered more than once.

1st normal form

Don’t have more than one column doing the same function

Don’t have duplicate rows

Today, there are normally four umpires for each Major League game. In the post-season, there are six. We could create six columns, UmpHP, Ump1b, etc, in our Games table to hold the codes for the umpire assigned to that position during that game. During the regular season, the two columns for lf and rf will be left blanl. A hundred years ago, there may have only been two umpires, four columns will be left blank. What hapens if there are seven umpires? Instead, normalization requires a new table for umpire assignments, which would have as fields GameID, UmpireID and UmpirePos. With one record for each umpire in each game, there is no limit on the number of umpires, and there are no blank records.

To eliminate duplicate rows, databases will insist on having one or more fields defined as a “primary key”, the value of which will be unique in each record. There will be an option to autonumber the records (first entered has an ID of 1, second 2, etc), or the user can select data fields. For example, RetroSheet has given each Major League team, player and ballpark in history an alpha-numeric code. Because these codes are unique, they can be used as the primary key of the teams, players and ballparks tables.

Using codes instead of auto IDs is most helpful when merging sets of data. Suppose I have different databases for the majors, minors and Japan, all using the same table layout. Then I decide to merge these into one database. If the tables use autonumbers for their primary and foreign keys, team “1” in the major league db might be Altoona Mountain City (ALT), while team “1” in the Japan db might be Chiba Lotte Marines (CHB). The players on the Marines, who have “1” as their team number, may now be listed as being on Altoona. The codes can be designed so that all primary keys of lookup tables have unique values across all leagues, and thuis will maintain correct linkages after merges.

2nd normal form

Eliminate repeating values in a column

In our persons table, there’s a field for place of birth. 163 players have been born in Cuba, 15116 in the United States. We could repeat United States, or U.S., or US, or USA, more than fifteen thousand times, or we could put a single record in a new Countries table, and have every person link to that record. If a name is mispelled, it only needs to be corrected in one place. It’s much less of a hassle for the database administrator (you) to verify that there’s only one version of the name being used, so that a query asking for all the players born in “USA” wouldn’t miss the players in “US” or “U.S.” etc. Instead of the using the actual text in the CountryBorn field, it would hold a “Foreign Key”, a code or number which matches the Primary Key of the table being used as a lookup. The Countries table can then hold additional detail in it’s columns, such as Abbreviation and FullName.

3rd normal form

No columns calculated from other columns

If H and AB are already in the batting table, there is no need to also include BA as it is derived from the other fields. If there’s an error, and someone’s hits need to be changed, the BA also would need to be changed. If a new record is entered, the BA would have to be calculated before entering the data. Queries are the method of doing calculations based on the data in the tables. Many baseball databases put everything into the tables and have no queries, intending the user to browse the tables. MySQL calls queries “Views” for a reason – that is where you are meant to go to view the data.

There are three types of relationships between tables.

One to One

Since the late 1980’s, we’ve had basic pitch data available. First pitch was a ball, second was a strike, etc. A Pitches table would have PA_ID, PitchNum, Result. A couple years ago pitch f/x added more than a dozen new fields for each pitch. Instead of going back to the Pitches table and adding all those new fields, all of which will be blank for the years when pitch f/x was not available, we should create a new Pitchfx table, which acts as an extension of the Pitches table. Records in both tables are identified by PA_ID and PitchNum. There can be no more than one Pitchfx record for each Pitches record, and vice versa. One pitch, one pfx record.

One to Many

Each plate appearance can have one or more pitches, each with their own record and description in the Pitches table, but each of those pitches can only belong to one plate appearance. One PA, many pitches.

Many to Many

A player can be, at one time or another, a member of many teams. Teams have many players. To show these relationships, a third (event) table is needed to show all the combinations of the two object tables, showing PlayerID, TeamID, year and then whatever detail applies, such as batting or pitching statistics. Many players, many teams.

Here are some examples of MySQL code to generate the structure for a few tables. After the tables are created and linked, data can be inserted. At this time, I want to focus on the PRIMARY KEY and FOREIGN KEY statements. Please note that these tables are for illustrations only. They do not represent a complete operational database as some fields have been deleted for brevity, and some other tables are not shown.

The Batting table uses PRIMARY KEY (`Player_CD`,`Team_CD`,`Year`) to identify the three fields that in combination make each record unique. FOREIGN KEY (`Player_CD`,`Year`) REFERENCES Players (`Player_CD`,`Year`) relates the Batting table to the primary key of the Players table, to get information about that player in that year. The Players table in turn uses Player_CD to reference to Person_CD in the Persons table, where name and birth information is available.

The Batting table also uses Team_CD and Year to reference the same fields in TeamSeasons. TeamSeasons in turn references tables that can return, for the given year, the team’s league, location, name, and home ballpark. In this way team TBA can be the Tampa Bay Devils Rays in 2007, but then the Tampa Bay Rays in 2008.

DROP TABLE IF EXISTS `Batting`;

CREATE TABLE `Batting` (

`Player_CD` varchar(9) NOT NULL default ”,

`Year` smallint(4) unsigned NOT NULL default ‘0’,

`stint` smallint(2) unsigned NOT NULL default ‘0’,

`Team_CD` char(3) NOT NULL default ”,

`G` smallint(3) unsigned default NULL,

`AB` smallint(3) unsigned default NULL,

`R` smallint(3) unsigned default NULL,

`H` smallint(3) unsigned default NULL,

`DO` smallint(3) unsigned default NULL,

`TR` smallint(3) unsigned default NULL,

`HR` smallint(3) unsigned default NULL,

`RBI` smallint(3) unsigned default NULL,

`SB` smallint(3) unsigned default NULL,

`CS` smallint(3) unsigned default NULL,

`BB` smallint(3) unsigned default NULL,

`SO` smallint(3) unsigned default NULL,

`HBP` smallint(3) unsigned default NULL,

`IBB` smallint(3) unsigned default NULL,

`SH` smallint(3) unsigned default NULL,

`SF` smallint(3) unsigned default NULL,

`GDP` smallint(3) unsigned default NULL,

PRIMARY KEY (`Player_CD`,`Year`,`stint`),

FOREIGN KEY (`Player_CD`,`Year`) REFERENCES Players (`Player_CD`,`Year`),

FOREIGN KEY (`Team_CD`,`Year`) REFERENCES TeamSeasons (`Team_CD`,`Year`)

) ENGINE=MyISAM DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS `Players`;

CREATE TABLE `Players` (

`Player_CD` varchar(10) NOT NULL default ”,

`Year` smallint(4) unsigned NOT NULL default ‘0’,

`bats` enum(‘L’,’R’,’B’) default NULL,

`throws` enum(‘L’,’R’,’B’) default NULL,

`weight` int(3) default NULL,

`height` double(4,1) default NULL,

PRIMARY KEY (`Player_CD`,`Year`),

FOREIGN KEY (`Player_CD`) REFERENCES PERSONS (`Person_CD`)

) ENGINE=MyISAM AUTO_INCREMENT=18543 DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS `Persons`;

CREATE TABLE `Persons` (

`Person_CD` varchar(10) NOT NULL default ”,

`nameLast` varchar(50) NOT NULL default ”,

`nameFirst` varchar(50) default NULL,

`nameGiven` varchar(255) default NULL,

`birthYear` int(4) default NULL,

`birthMonth` int(2) default NULL,

`birthDay` int(2) default NULL,

PRIMARY KEY (`Person_CD`),

) ENGINE=MyISAM AUTO_INCREMENT=18543 DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS `TeamSeasons`;

CREATE TABLE `TeamSeasons` (

`Year` smallint(4) unsigned NOT NULL default ‘0’,

`Franchise_CD` char(3) NOT NULL default ‘UNK’,

`Team_CD` char(3) NOT NULL default ”,

`TeamLocation_CD` char(3) default NULL,

`TeamSponsor_CD` char(3) default NULL,

`TeamName_CD` char(3) default NULL,

`League_CD` char(3) NOT NULL default ”,

`Division_CD` char(1) default NULL,

`Park_CD` char(5) default NULL,

PRIMARY KEY (`Team_CD`,`Year`),

FOREIGN KEY (`Franchise_CD`) REFERENCES Franchises (`Franchise_CD`),

FOREIGN KEY (`Team_CD`) REFERENCES Teams (`Team_CD`),

FOREIGN KEY (`TeamLocation_CD`) REFERENCES TeamLocations (`TeamLocation_CD`),

FOREIGN KEY (`TeamSponsor_CD`) REFERENCES TeamSponsors (`TeamSponsor_CD`),

FOREIGN KEY (`TeamName_CD`) REFERENCES TeamNames (`TeamName_CD`),

FOREIGN KEY (`League_CD`) REFERENCES Leagues (`League_CD`),

FOREIGN KEY (`Division_CD`) REFERENCES Divisions (`Division_CD`),

FOREIGN KEY (`Park_CD`,`Year`) REFERENCES ParkSeasons (`Park_CD`,`Year`)

) ENGINE=MyISAM DEFAULT CHARSET=latin1;

 

 

6 Responses to Database Toolbox: Tables

  1. Colin Wyers says:

    I am not sure the extent to which I agree with this. I understand the point of view being espoused here, but I don’t think it matches up very well with the goals that I, at least, have. I actually think that denormalization of the data is the way to go; I need to read up more on dimensional databases and fact tables.

  2. Brian Cartwright says:

    What is it that you don’t agree with?
    My personal goal is to get a comprehensive db that I can update with GameDay. I will share, and you are free to use it or not.
    I do think there is a problem with many of the existing db’s not being normal. Great sets of data, not so great db design. I plan on doing a comparison of these db’s, showing what they contain and what they don’t, and when they are not normal.
    I will read up on your link.

  3. Colin Wyers says:

    Because I don’t think the problems that normalization were meant to solve and the problems we’re trying to solve are the same problems. I’d hate to try to run queries on a fully-normalized Retrosheet play-by-play table!
    From a performance standpoint, normalization has real and tangible costs. You also see escalating query complexity. Given that few of us are professional DBAs and the overall similarity of what we do to data warehousing, I think less normalization – not more – is the answer.

  4. joe arthur says:

    You’re both right in a sense. This part of database design is a tradeoff process. Generally speaking the goal of database design is to achieve as normalized as possible a design, consistent with performance objectives. Denormalization should be allowed specifically as needed to achieve the desired performance objectives for the types of queries expected to be run. For a database designed with particular interests in mind, it is reasonable to seek some denormalization. But for a general purpose database such as the one Brian is trying to build, it may not be. It’s not rare to cope with this conflict by setting up one or more additional specialized (satellite) databases which are denormalized in the desired ways and which are refreshed as needed from the more fully normalized data warehouse. I’ve done this myself to decompose retrosheet plate appearance records into useful pitch by pitch records repeating the sort of outcome information which we tend to be interested in.
    Brian, in general a very astute and comprehensive write-up. I’ll just comment on a few details …
    1) One issue with several ramifications is data consistency. When I built cross references for different player IDs used by different sources, I was confronted by the issue that sometimes they disagreed about birth dates and heights and so on. I did not attempt to build a single comprehensive player master of my own in which I attempted to reconcile the discrepancies and create my own version of the truth. I kept all the different player masters in their own tables, using one as my primary by default – but therefore I did not insist on foreign key integrity directly to a player master table as you propose – instead the foreign key constraint is to my “master” xref_ids table. Beyond player masters, what do you intend to do when gameday and retrosheet disagree about the outcome of particular pitches or the total number of pitches thrown? This is the same kind of problem – either you commit to reconciling them and creating the Brian Cartwright version of what really happened, or you have to design your tables more flexibly to allow both sources of pitch by pitch information as alternate versions which you can access for different purposes.
    2) “codes instead of auto IDs is most helpful when merging sets of data” – actually I would argue that your use of auto IDs should be a last resort solution regardless of any interest in data merges. Which is more useful, to know that you are looking at a record for team 14, or “TOR”? or player 150484 or “wellsve01” ? There’s an auto_id component in the latter, but you can usually guess that you are looking at data that is supposed to be associated with Vernon Wells. It is an aid to data quality to use natural keys as much as feasible. You have a better idea what you’re looking at, you make any manual data entry errors less likely because you’re typing something familiar and meaningful, and it may not be necessary to write an extra join just to expand an auto ID into something recognizable. Finally with auto IDs as your only unique key you can get unintentional duplication – you wouldn’t get a duplicate key error in your fielding database if you happen to enter information for “Mark McLemore” Oakland 2004 “2B” twice – but if you use these or similar natural values to make at least an alternate unique key, you get that protection.
    3) your heavy use of foreign keys and normalization (and I agree with that – with the stricture to avoid auto increment IDs as much as possible) implies some transactional activity. In your schema, in 2009 when you encounter a new player at mlb.com or milb.com, you will need to insert successfully into your persons table first before inserting “event” records which depend on that table. In all your table creation examples you specify the non-transactional myISAM storage engine; you may want to investigate using InnoDB. To Colin’s original point, some denormalization may be appropriate on the input side as well as for the output (query) side which he brought up. It’s possible you will find it frustratingly slow to insert data into fully normalized tables.

  5. Brian Cartwright says:

    I haven’t finished reading the Data Warehouse article yet, but I get the gist of it so far.
    I’m not insisting on 100% normalization, but I want the readers to understand what the rules are, explained in plain English, and why they are important.
    I’ve been working most recently with the Baseball DataBank. There’s no table with just a list of teams, but in the batting, team and other tables, there’s a field which holds the RetroSheet TeamID. This is good. I can link to someone else’s team table that uses the same coding.
    However, even though RetroSheet has ballpark IDs as well, in the park field the codes are not listed, insteading using the full text of the ballpark name. This is not good. When there’s more than one scheduled home park for a team, the two names are in the same field, seperated by a comma. This is really not good. Strict normalization would have a new table, listing year, team & park. I’m OK with having two fields, park and alt_park. It’s not a field I think I would do any stat querying on. If these are instead filled with the RetroSheet codes, it can link to KJOK’s park and parkseason tables. The parkseason table can then have a field for the name of the ballpark during that season.
    If, at a minimum, we use the RetroSheet codes where they exist, then different db’s can act as modules that can be link to tables from other db’s.
    So now that’s probably half my next article

  6. Brian Cartwright says:

    Joe, re your #3 – the code was pasted (with modifications) from Baseball DataBank. I overlooked the database engine, thanks for pointing that out.
    And yes, when inserting records, correct order is necessary, as you can’t link to a record that hasn’t been created yet. That will be part of the script which parses GameDay into the database.
    Thanks for some excellent points

Leave a reply to Colin Wyers Cancel reply