Building a sabermetrician's workbench, part II

Okay, hopefully you’ve had a chance to play around with the stuff in Part I and know just enough SQL to be dangerous. If you haven’t read Part I yet at all… go get dangerous and come back when you’re ready.

I left everybody with a challenge:

In addition to the batting table, there’s a pitching table as well. Try writing a query to list the 50 lowest career earned run averages, minimum 1000 innings pitched. (Here’s a little tip: the Databank doesn’t list IP for pitchers, simply outs. Remember to divide by three.)

If you haven’t figured it out yet, and are intent on doing it yourself, I’d stop reading right about now. If you did figure it out, or simply want the answer, here’s one way of doing it:

   1: SELECT playerID
   2:     , SUM(ER)/SUM(IPOuts/3)*9 AS ERA
   3:     , SUM(IPOuts/3) AS IP
   4: FROM pitching
   5: GROUP BY playerID
   6: HAVING IP >= 1000
   7: ORDER BY ERA Asc
   8: LIMIT 50;

If you run this query, you’ll have a lot harder time recognizing the top leaders than the top batting average leaders. A lot of these guys were deadball-era pitchers. So let’s find these guys’ names.

Working with joins, a first glance

Often, all of the data you want to use won’t be stored in a single table – you’ll have some data in one table, and some data in another. That’s where joins come into play. A join lets you hook two (or more) tables together for querying.

There are two ways to do a join: explicitly and implicitly. When I say an explicit join, I mean a join where you use the JOIN command. An implicit join is one where MySQL is just supposed to figure out that you want it to join. For right now, we’ll start off with the implicit join.

In this case, what we want are the first and last names out of the master table. In this case, what we’re really after is a player’s full name. So let’s put that together:

   1: SELECT CONCAT(nameFirst," ",nameLast) AS Name, playerID FROM master

CONCAT is a fun command for string manipulation. It lets you take multiple strings and combine them together. In this case, we took the first name, a space (surrounded by quotes) and a last name, separating them with commas. CONCAT does the rest.

Instead of joining to the entire master table, we’re going to join to the results of that query using an alias.

In SQL, you use the period separator to indicate what table you want to take data from, like so:

table.column

where table is the name of the table and column is the name of the column. You can also assign an alias to a query when you use it in the FROM command, like so:

   1: SELECT p.playerID
   2:     , m.Name
   3:     , SUM(p.ER)/SUM(p.IPOuts/3)*9 AS ERA
   4:     , SUM(p.IPOuts/3) AS IP
   5: FROM pitching p
   6:     , (SELECT CONCAT(nameFirst," ",nameLast) AS Name
   7:         , playerID FROM master) m
   8: WHERE p.playerID = m.playerID 
   9: GROUP BY playerID
  10: HAVING IP >= 1000
  11: ORDER BY ERA Asc
  12: LIMIT 50;

So now, instead of seeing that walshed01 is the career ERA leader, we can see that it’s Ed Walsh. It’s an improvement, right?

The key is in the WHERE clause, which matches up a column from one table with the other. The (mostly) equivalent explicit join is:

   1: SELECT * FROM (SELECT playerID
   2:      , SUM(ER)/SUM(IPOuts/3)*9 AS ERA
   3:      , SUM(IPOuts/3) AS IP
   4: FROM pitching
   5: GROUP BY playerID
   6: HAVING IP >= 1000) p
   7: JOIN (SELECT CONCAT(nameFirst," ",nameLast) AS Name
   8:          , playerID FROM master) m
   9: ON p.playerID = m.playerID
  10: ORDER BY ERA Asc
  11: LIMIT 50;

We end up with a second, unwanted ID column, but otherwise it’s pretty much the same thing.

You want to be careful when writing the WHERE or ON clause of a join – if you aren’t careful, you’ll end up doing what’s called a Cartesian join, or essentially joining every row in one table to every row in another. For those of you keeping score at home, that would mean joining 39,016 rows from the pitching table to 17,264 rows from the master table, giving you… 673,572,224 rows as a result.

You do not want that.

We’re far from done with joins – they’re quite possibly the most powerful (and most dangerous!) tool in your SQL arsenal. But for right now let’s move along to some other things. We’ll be back.

Dealing with data types

I want you to see this at least once before it creeps up on you by accident. I do not, on the other hand, want to be the sort of tutorial writer who hands you a code sample and then, when it doesn’t work, shouts “Gotcha!” So I’ll be clear with you: This doesn’t work right. Let’s say you want  the top 50 player seasons with more runs scored than runs batted in.

Try running this:

   1: SELECT playerID, yearID, teamID, R-RBI AS R_RBI
   2: FROM batting
   3: ORDER BY R_RBI Desc
   4: LIMIT 50;

We are being lied to here. Nobody, but nobody, has 18,446,744,073,709,551,615 more R than RBI in a single season. I don’t even know how to say that number.

The problem is with data types. SQL has an awful lot of data types, but we’re only going to look at a few of the more common ones.

Text is generally stored as “varchar,” or “Variable Character Field.” What that means is simply that you can store any length of text in there up to the size of the field. There are some other text formats, but nothing to really be worried about right now.

There’s a much greater diversity of number formats, which can be broken down into three types:

  1. Integers, or whole numbers.
  2. Decimals, for precise fractional values.
  3. Floats, for approximate fractional values.

These can be further subdivided – you’ll see values like “tinyint” or “smallint” thrown around, for example, as ways of describing the size of the number you can store in that column.

There’s also the issue of signs which becomes important here. For SQL, what we mean by signs is the plus/minus sign – whether a number is positive or negative.

You can store numbers in MySQL either as “signed” or “unsigned”. Unsigned values cannot be negative. The Baseball Databank stores all numbers as unsigned.

(Why? Because it saves storage space and time.)

So in order for you to deal with cases where values may be negative, you have to change the datatype before doing any calculations. We do that with a CAST function:

   1: SELECT b.playerID
   2:     , m.Name
   3:     , b.yearID
   4:     , b.teamID
   5:     , CAST(R-RBI AS SIGNED) AS R_RBI
   6: FROM batting b
   7:     , (SELECT CONCAT(nameFirst," ",nameLast) AS Name
   8:         , playerID FROM master) m
   9: WHERE b.playerID = m.playerID
  10: ORDER BY R_RBI Desc
  11: LIMIT 50;

That should make considerably more sense. Now let’s say we’re just interested in the post-WWII years. Simple:

   1: SELECT b.playerID
   2:     , m.Name
   3:     , b.yearID
   4:     , b.teamID
   5:     , CAST(R-RBI AS SIGNED) AS R_RBI
   6: FROM batting b
   7:     , (SELECT CONCAT(nameFirst," ",nameLast) AS Name
   8:         , playerID FROM master) m
   9: WHERE b.yearID > 1943
  10: AND b.playerID = m.playerID
  11: ORDER BY R_RBI Desc
  12: LIMIT 50;

Remember to think about the order of your WHERE statements when using an implicit join. Try to filter out as many rows as possible before you start linking the tables – that will make your joins more efficient.

The heart of the CASE

There are times when you want your query to be able to make some decisions on its own. Let’s say , for instance, that we want to look at pitchers between the ages of 25 and 29, post WW-II. Remember, a player’s age is generally listed as his age come July 1st. So, a simple little query here:

   1: SELECT p.playerID
   2:     , m.Name
   3:     , p.yearID
   4:     , (CASE WHEN m.birthMonth < 7
   5:         THEN ( p.yearID - m.BirthYear )
   6:         ELSE ( p.yearID - m.BirthYear - 1 ) END) AS Age
   7:     , p.ER/p.IPOuts/3*9 AS ERA
   8:     , p.IPOuts/3 AS IP
   9: FROM pitching p
  10:     , (SELECT CONCAT(nameFirst," ",nameLast) AS Name
  11:         , playerID FROM master) m
  12: WHERE p.yearID > 1943
  13: AND p.playerID = m.playerID
  14: GROUP BY playerID
  15: HAVING Age BETWEEN 25 AND 29
  16: ORDER BY ERA Asc
  17: LIMIT 50;

The CASE statement allows you to have different outputs based upon the underlying data. This could have easily been written as an IF statement as well:

   1: , IF(m.birthMonth < 7
   2:     ,( p.yearID - m.BirthYear )
   3:     ,( p.yearID - m.BirthYear - 1 )) AS Age

I like CASE because it feels cleaner, personally. You can also use multiple WHEN statements inside a CASE.

Saving your work

Let’s say you’ve written an amazing, bad-ass query. You know you’re going to want to look at this one later. You have two options:

   1: CREATE VIEW view_name AS
   1: CREATE TABLE table_name AS

Simply put either one of those at the front of your query.

So what’s the difference? A view:

  1. Updates whenever the underlying data does.
  2. Preserves the query used to generate it.

Creating a table:

  1. Does not update to incorporate changes in data.
  2. Does not run the query every time it’s called.

A view is more flexible; a table is quicker.

Also, I recommend creating a new database to store tables in – multiple databases if you have multiple projects. So let’s say we wanted to start a new database for a project that does Marcels projections. Right-click in the pane on the left and select “Create Database.” Call it Marcels and click on it to make it the active database. In order to run queries on tables in the Baseball Databank, simply prefix the names of the tables with bdb., like so:

bdb.batting

bdb.pitching

Well after that little slip, it would hardly be fair to send you on your way without some Marcels projections, wouldn’t it?

First off, I just want to give thanks to Tangotiger, who came up with the Marcels, and Sal Baxamusa, whose spreadsheet versions of the Marcels were invaluable as learning aids for me.

I tried my best to reimplement the Marcels as faithfully as possible, but there are (likely) differences between Tango’s work and mine. All credit goes to him and Sal, all blame for mistakes and errors goes to me.

Here’s the code for hitter projections; pitcher code is still undergoing revisions. Have fun with it, play around with it. Try to make it better, try to break it, try to take it apart and see how it works.

Also, some further reading if you’re interested:

I don’t want to say that wraps up all I have to say on the topic – it doesn’t – but it covers most of the important, have-to-know stuff. I think – but am not promising – I’ll be doing a third installment of this, and so if there’s some topic you want to see covered more or something you feel is missing, now’s the time to let me know.

Advertisements

22 Responses to Building a sabermetrician's workbench, part II

  1. Josh Engleman says:

    Not sure if it’s just me, but I can’t get to the Marcel code. The pastebin link seems to be down.

  2. Colin Wyers says:

    Pastebin isn’t working for me right now, either. Here’s a backup.

  3. lar says:

    i’m a fairly experienced sql guy, but I appreciate your hints on tools to use (I’d been using access on my personal machine, but I much prefer your suggestion of SQLyog and I hadn’t known about the Baseball Databank database before, just the lahman db).
    Anyhow, is there a preferred way of getting at a player’s primary fielding position that i’m missing… the only thing I can think of uses a subquery or two, and that seems like it’d slow some things down… Are there other pieces of data like this that I’m forgetting but might be a little harder to get at than normal?
    thanks again.

  4. Colin Wyers says:

    It’s an ugly query, but it seems to work pretty well, and it doesn’t use any subqueries:
    SELECT playerID
    , yearID
    , teamID
    , (CASE WHEN G_p = GREATEST(G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_dh)
    THEN “P”
    WHEN G_c = GREATEST(G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_dh)
    THEN “C”
    WHEN G_1b = GREATEST(G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_dh)
    THEN “1B”
    WHEN G_2b = GREATEST(G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_dh)
    THEN “2B”
    WHEN G_3b = GREATEST(G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_dh)
    THEN “3B”
    WHEN G_ss = GREATEST(G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_dh)
    THEN “SS”
    WHEN G_lf = GREATEST(G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_dh)
    THEN “LF”
    WHEN G_cf = GREATEST(G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_dh)
    THEN “CF”
    WHEN G_rf = GREATEST(G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_dh)
    THEN “RF”
    WHEN G_dh = GREATEST(G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_dh)
    THEN “DH”
    ELSE “UNK”
    END ) AS POS
    FROM appearances
    ORDER BY POS;
    This only works for years 1973 and up, which is all the appearances table covers. I know Tom Tango had a script for this on his website from a while back to work with the fielding table.

  5. Josh Engleman says:

    I really appreciate what you are doing here, Colin. If you are looking for any suggestions, I think it would be great if you broke down what what going on in the Marcel query. Not even just from a mysql standpoint, but also from a marcel standpoint. I know Tango has the Marcel set-up broken down, but it is also kind of vague to someone who doesn’t really know what they are looking at. Anyway, just a suggestion and I’m thrilled you’ve done this work so far.

  6. Colin Wyers says:

    Here’s a query for primary position that works on the fielding table:
    SELECT playerID
    , yearID
    , teamID
    , MAX(G) AS G
    , POS
    FROM (SELECT * from fielding
    WHERE IF(yearID>1995 AND POS = “OF”,1,0) != 1 ORDER BY G Desc) f
    GROUP BY playerID, yearID, teamID;
    I would definately check those results first – I’ve done some quick validation on the results but nothing too thorough.
    I did a writeup of the Marcels a while back on my old blog (ignore the references to Part II, there never was one.)

  7. lar says:

    thanks, Colin. I think the second one is what I was thinking of, but I was trying to make it too complex (by putting the POS as part of the group by, which complicates things to no end).
    for others who are looking at this, you’ll have to remember to pay attention to players who are traded mid-season, as they show up with multiple “stints”.
    I recommend creating a view for full year batting/pitching/fielding stats. for example, the batting and fielding views I created are:
    create view full_year_batting as
    select playerid, yearid, sum(g) as G, sum(g_batting) as G_batting, sum(ab) as AB, sum(r) as R, sum(h) as H, sum(2b) as 2B,
    sum(3B) as 3B, sum(HR) as HR, sum(RBI) as RBI, sum(sb) as SB, sum(CS) as CS, sum(bb) as BB, sum(so) as SO, sum(IBB) as IBB,
    sum(hbp) as HBP, sum(SH) as SH, sum(SF) as SF, sum(GIDP) as GIDP, sum(g_old) as G_old
    from batting
    group by playerid, yearid
    create view full_year_pitching as
    create view pitch_year_total as
    select playerid, yearid, sum(w) as W, sum(L) as L, sum(G) as G, sum(gs) as GS, sum(cg) as CG, sum(sho) as SHO,
    sum(sv) as SV, sum(ipouts) as IPouts, sum(H) as H, sum(ER) as ER, sum(HR) as HR, sum(BB) as BB, sum(SO) as SO, sum(IBB) as IBB,
    sum(wp) as WP, sum(HBP) as HBP, sum(BK) as BK, sum(BFP) as BFP, sum(GF) as GF, sum(R) as R
    from pitching
    group by playerid, yearid ;
    (when creating these full year views, you’ll have to re-figure the averages, like ERA or Batting Average)

  8. Josh Engleman says:

    OK, I assumed that using the Marcel code by simply copying and pasting it would work, but I don’t think that it is. I don’t see tables created anywhere after I run the query. Besides creating a database called “marcels”, is there a step that I am missing?

  9. Colin Wyers says:

    You’ll have to run each query seperately, Josh. (Next to the green play button is a double green arrow that will batch run several queries at once.) You should end up with two tables and five views (you can’t do subqueries with view in MySQL, so I couldn’t make everything a view unless I wanted to make even MORE views.)

  10. Colin:
    Thanks so much for this series.
    I’m sure I’m being an idiot again, but do I need to be “in” my Marcels database when I run the query? I’m not sure I’m getting it to work right.
    Does F9 run a single query, or all of them?

  11. TucsonRoyal says:

    First of all, thanks for the articles. It is a relief to be off Excel.
    Second, without having to use the play by play of the Retrosheet logs, is there a database of just game logs, basically the box score or player game-by-game data? Thanks.

  12. Colin Wyers says:

    Retrosheet does have game logs in addition to play-by-play data. I haven’t done anything with them.
    You should be in your Marcels database when you run the queries. F9 will only run the currently selected query (I’m at work and can’t tell you what the hot-key is to run all queries at once).

  13. David Olson says:

    Whenever I try to run a query for ages, I get an error that says Unknown column ‘m.birthMonth’ in ‘field list’
    So then I put m.birthmonth and both m.birthYears in quotation marks so it looked like this
    select p.playerID
    , m.Name
    , p.yearID
    , (CASE WHEN “m.birthMonth” 1943
    AND p.playerID=m.playerID
    GROUP BY playerID
    Having Age BETWEEN 25 AND 29
    ORDER BY ERA Asc
    LIMIT 50;
    When I do that nothing shows up at all…what am I doing wrong?

  14. TucsonRoyal says:

    I guess the only idea I would like for you go over if you do a #3 is the importing of retrosheet data (I have tried to follow Tom Tango’s Wiki and get lost when it begins to switch between SQL and Oracle) or having a database available to download in SQL format. If hosting the download is a problem, I would gladly host it.

  15. john says:

    yeah something with retrosheet pbp data, gamelogs or pitch fx info would be great.

  16. Zach says:

    How would I know if players who hit more home runs receive higher salaries than players who hit fewer home runs?
    and also
    How would I know if players who have nick names (nameNick in the Master table) tend to receive higher salaries than players that don’t have nick names?

  17. Sal says:

    what would be the SQL query to find out if players who have nick names (nameNick in the Master table) tend to receive higher salaries than players that don’t have nick names?
    what would be the SQL query to find out if players who hit more home runs receive higher salaries than players who hit fewer home runs?

  18. Montecristo says:

    what would be the SQL query to find out if players who hit more home runs receive higher salaries than players who hit fewer home runs?
    and what would be the SQL query to find out if players who have nick names (nameNick in the Master table) tend to receive higher salaries than players that don’t have nick names?
    Thanks

  19. Montecristo says:

    What would be the SQL query to find out if players who have nick names (nameNick in the Master table) tend to receive higher salaries than players that don’t have nick names?
    what would be the SQL query to find out if players who hit more home runs receive higher salaries than players who hit fewer home runs?
    Thanks

  20. snites9113 says:

    Hey Colin (or anybody else who can answer),
    I’m trying to create a database of Japanese stats. I found a link to them on this site (http://statspeak.net/2008/12/2008-japanese-stats.html) but they’re in .csv format. Do you know how I would convert this to a SQL format so I can put all this into practice??
    I’ve tried this:
    LOAD DATA INFILE ‘C:\\…\\pitching.csv’
    INTO TABLE jstats.pitching
    FIELDS TERMINATED BY ‘,’
    ENCLOSED BY ‘”‘
    ESCAPED BY ‘\\’
    LINES TERMINATED BY ‘\r\n’
    It doesn’t seem to be doing anything. Can anybody help me here?
    Thanks.

  21. anon says:

    I had this same problem, and here’s what ended up working for me. This line of code from the article
    , (SELECT CONCAT(nameFirst,” “,nameLast) AS Name , playerID FROM master) m
    I rewrote like this:
    , (SELECT CONCAT(nameFirst,” “,nameLast) AS Name, playerID, birthmonth, birthyear FROM master) m
    Not sure why, but that seemed to work.

  22. Pingback: And So It Begins… | MSilb Baseball Scouting and Analytics

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: