Building a sabermetrician’s workbench, part I

Introduction

There are a lot of budding sabermetricians out there. (Actually, I’m sure that’s not true. But it sure seems that way. Of course, this is because I tend to hang out in places, like Tango’s blog and Beyond The Boxscore, that are popular hangouts for budding sabermetricians.)

Now let’s face it – life is rough out there on the streets, cadging data from websites and working with spreadsheets that start to look like rat’s nests. You’re tired, cold and hungry. You want to just put away your “Will Work For Data” signs and do some real research.

There is a better way! That way is… the relational database. A relational database offers you all the brilliant promise of tomorrow, and I don’t mean some sort of Total Recall tomorrow but a totally boss tomorrow, like an Epcot Center tomorrow without all the vague creepiness.

What I’m writing here is, or at least is supposed to be, the sort of thing I wished I had about a year ago – a step-by-step explanation of how to get started using a database to do sabermetric research.

I am not a professional DBA admin, or a programmer of any kind – what you are getting is the approach of a dedicated hobbyist but a hobbyist nonetheless. I may not follow “best practices.” I will not be explaining a lot of database theory or lecturing you on the importance of being “relational.” Quite frankly, I don’t know too much about that myself. And if you have your own database setup and workflow already, your setup may differ from mine. I hope you can still derive some value from this.

Getting a database

First, you need a database. I use MySQL. If you do a Google search, you may well find a lot of arguing over whether or not MySQL is any good. (For what it’s worth, a lot of large companies and government agencies use MySQL for serious work.) You’ll hear them argue about things like “transactions” and the ACID model. None of this applies to folks like you and I, trust me. And the best part – it’s absolutely free.

Here’s how to install it:

Download MySQL 5.1 CE Essentials. (It should be the first link under “Windows Downloads.” If you aren’t running Windows… well, I suspect you know more about your Linux/Mac/Amiga machine than I do, so I leave that as an exercise to the enterprising reader.) Then, once you’ve downloaded it, run the MSI file.

Choose a “typical” install.

There will be some ads. Skip over them. Make sure that “Configure the MySQL server now” is checked. Chose “Detailed configuration.”

Then you’ll go through a set of pages asking you questions. Here are the answers:

  • Developer Machine
  • Non-Transactional Database Only
  • Decision Support (DSS)/OLAP

Continue on, using the defaults.

Either add a password or uncheck the box to alter security settings. (This will leave your root password blank.)

That should do it! From here on out, you shouldn’t have to start MySQL at all – it should be ever vigilant, waiting for you to command it to do your sabermetric bidding.

Talking to the database

Let’s face it – your time is valuable, and while you’re not exactly lazy you really don’t want to do any more work than you have to. (Well, maybe you are lazy; I really wouldn’t know.) So as much fun as it sounds to sit there and labor over a dark command prompt, what you really want is a graphical front end to pretty up your interactions with the database.

Again sticking with the theme of “absolutely free,” I use SQLyog. You want the “Windows-Binary” from the Community Edition. Installation is straightforward – just download it and click “Next” a few times. Then run SQLyog.

From there, you’ll be presented with a screen asking you for the settings to connect to the database. Everything should be correct – if you selected a password, type it into the password blank. Then hit “Connect.”

And… voila! You’ve been connected to, well, to an empty database. I suspect you want some actual data in your database.

Putting the “data” in database

There’s obviously a lot of different things you could want to have in your database, depending on your area of interest. And at some point, you may want to dive into the exciting world of play-by-play data ,from Retrosheet and Gameday. But let’s start off modestly, if not exactly small. Let’s start with the Baseball Databank. It’s a record of pretty much every official statistic for every baseball player in the history of the majors. It’s a great resource, and a great starting point.

First, download the ZIP file. Extract the .SQL file from the ZIP, and remember when you put it. Now go into SQLyog.

In SQLyog there will be three panes – the left-hand side should list your databases, the top is where you write queries, and the bottom shows the results. Let’s add a database. Right-click in the pane, and select “Create Database.” Call it “BDB,” leaving off the punctuation. Leave the rest as the defaults, and click okay.

Now, let’s populate the database. Right-click on the database, and select “Restore from SQL Dump.” There should be an button with an elipsis (…) in it. Click that, and select the file we extracted from the ZIP. Select Execute. It’ll ask you if you want to execute the queries in database BDB. You do. This should take a little while.

Once that’s done, go into your database and click on the folder that says Tables. If nothing is there, try hitting “F5” on the top row of your keyboard to refresh the browser.

Here’s a little screenshot of what your screen is supposed to look like now. (Okay, so I do have a few… other databases as well.)

Go ahead and click on batting, and then slide on over to the right and click on "Table Data.” Scroll around for a bit. If you’ve ever used a spreadsheet or looked at a table before, the layout should feel familiar. Now it’s time to play around with the database a little.

Asking the database questions

The fundamental unit of the database is the table, which is further broken down into the row and the column. Each row is a “record” in the database; you can have as many records as you like. Each column is given a specific name, which is how you address it.

You won’t be entering formulas into cells, like you would a spreadsheet. You interact with the data in the spreadsheet by writing queries. We’ll start off with the simplest query possible.

Slide up into the query editor and type in this:

   1: SELECT * FROM batting;

There is a green play button in the upper left-hand corner; press it. (Or just hit F9.) You should get… well, to be rather anticlimactic, you got the same table we were just looking at. Most of the column headers should be pretty self-explanatory – H is hits, SF sacrifice flies, etc. playerID is each player’s unique identifier, taken from the first five letters of the last name and the first two letters of the first name, plus a number. Some are easier to grok than others.

[A little note: for your convenience, SQL commands will be highlighted in blue in all of the code samples. SQLyog provides even more highlighting for your code.]

To explain:

  • SELECT is your bread-and-butter SQL command; it tells the database to fetch records out of the table.
  • * is a wildcard; it tells the database to fetch all columns.
  • FROM tells the database what table to fetch from.
  • The semicolon tells the database that our instruction is complete.

Let’s expand upon our code sample a bit now. Say we only want to look at the 1990 Royals:

   1: SELECT * FROM batting
   2: WHERE yearID = "1990" AND teamID = "KCA";

If SELECT is our bread and butter, WHERE is our butterknife. It tells the database to only go and fetch the records we need. What we’re telling it is to only give us records where the yearID column reads 1990 and the teamID column reads KCA. The quotation marks are important – if you don’t use quotation marks, MySQL will assume you’re referring to a column of that name.

There’s a special value here: wherever it says (NULL), it’s telling us that cell is empty. In the batting table, NULLs are generally used to indicate:

  • Players who appeared in a game but had no official plate appearance (generally relief pitchers, pinch-runners and defensive replacements).
  • Statistics that were not tracked that year (generally sacrifice flies, double plays, etc.).

In this case, we really only want to see players who went to the plate. There’s a way to handle NULL values:

   1: SELECT * FROM batting
   2: WHERE yearID = "1990" AND teamID = "KCA" AND AB IS NOT NULL;

And that should give us the raw counting stats of everyone who played on the Royals that year. Suppose we want to know the “sabermetric triple crown” stats of batting average, on-base percentage and slugging percentage. (Tango’s sabermetric wiki is a great source of formulas to try out, incidently.)

   1: SELECT *, H/AB AS AVG
   2:     , (H+BB+IBB+HBP)/(AB+BB+IBB+HBP+SF) AS OBP
   3:     , (H+2B+2*3B+3*HR)/AB AS SLG
   4: FROM batting
   5: WHERE yearID = "1990" AND teamID = "KCA" AND AB IS NOT NULL;

You don’t have to break up the different lines; I do that for legibility. You do want to seperate each thing you want to select by commas. I put the commas at the start of each individual line, but that’s a personal style preference. SQL simply ignores carriage returns (that is to say, line breaks), tabs and extra spaces, so feel free to “pretty up” your code in a way that makes it easy for you to read.

SQL will do any simply arithmetic with the records you provide. (You can also do a lot more complex math with it if you really want to.) The AS command is very helpful – otherwise the column will end up being named “(H+BB+IBB+HBP)/(AB+BB+IBB+HBP+SF),” which simply isn’t very convenient.

Let’s say we don’t want every column, we just want a few columns:

   1: SELECT playerID, AB, H, 2B, 3B, HR, BB
   2:     , SO, IBB, HBP, SH, SF
   3:     , H/AB AS AVG
   4:     , (H+BB+IBB+HBP)/(AB+BB+IBB+HBP+SF) AS OBP
   5:     , (H+2B+2*3B+3*HR)/AB AS SLG
   6: FROM batting
   7: WHERE yearID = "1990" AND teamID = "KCA" AND AB IS NOT NULL;

Bored of the 1990 Royals yet? Let’s try something different.

   1: SELECT playerID, yearID, teamID, HR
   2: FROM batting
   3: ORDER BY HR Desc
   4: LIMIT 50;

ORDER BY tells the database to sort the list; Desc tells it to go from large to small. (Asc is the corresponding command to sort from small to large.) LIMIT tells SQL to only return the first 50 records.

So that’s the top 50 home-run seasons. (Supply your own asterisks where you feel it necessary.) But what if we want career totals?

   1: SELECT playerID, SUM(HR) AS HR
   2: FROM batting
   3: GROUP BY playerID
   4: ORDER BY HR Desc
   5: LIMIT 50;

The SUM command will sum everything the column unless you stop it. GROUP BY tells the database to combine everything based upon player ID.

Now what if want to look at the top 50 career batting averages? We obviously don’t want to see some pitcher who lucked his way into going 2 for 3 and never saw the majors again. So let’s say we want a minimum 3000 career plate appearances.

   1: SELECT playerID
   2:     , SUM(H)/SUM(AB) AS AVG
   3:     , SUM(AB+BB+HBP+COALESCE(SF,0)) AS PA
   4: FROM batting
   5: GROUP BY playerID
   6: HAVING PA >= 3000
   7: ORDER BY AVG Desc
   8: LIMIT 50;

Because we’re using the SUM function, we have to use the HAVING function instead of the GROUP BY function. Note the >= – that’s greater than or equal to.

So, what’s that COALESCE doing there? Remember, sacrifice flies weren’t officially recorded until 1954. For years prior, those values are NULL in the database. Trying to sum a NULL value chucks everything out. Using the COALESCE function in this way tells the database to substitute 0 in the place of a NULL. Otherwise, players like Ty Cobb would be excluded from the list (precisely the sorts of players we don’t want excluded!)

We’re just scratching the surface here, but alas the rest will have to continue next week. A few quick notes and then I’ll leave you to your own devices for a while.

First, some helpful resources:

One final tip for SQLyog – if you right-click on a table, you’ll have options to export as a CSV file or copy to the clipboard (perfect for dumping data into an Excel spreadsheet.)

And now, a little challenge for you. In addition to the batting table, there’s a pitching table as well. Try writing a query to list the 50 lowest career earned run averages, minimum 1000 innings pitched. (Here’s a little tip: the Databank doesn’t list IP for pitchers, simply outs. Remember to divide by three.)

See you next week!

Advertisements

33 Responses to Building a sabermetrician’s workbench, part I

  1. Tim Daloisio says:

    Man…I hope you grade “on the curve” 😉
    Seems like I’ve got some homework to do!

  2. john says:

    Good stuff.
    If I have one from 2007, is there a way to just update it when the 2008 data comes out? Or do you just delete the database and make a new one each time?

  3. Sky says:

    This is as good a motivation as any to finally learn to use SQL like a pro. Great idea, Colin, and I’ll thank you in advance for everything that comes out of this.

  4. Sky says:

    I’ll be using a Mac for this, so if anyone wants to chime in on anything that might be different for someone running OSX 10.3.9, it would be appreciated.

  5. Josh Engleman says:

    Colin,
    Thanks a lot for this. I’ve been trying to get myself away from Excel and into mysql, but I just haven’t been able to wrap my head around it. I know you are planning on more parts for this, but I have a quick question.
    In your example where you create AVG, OBP and SLG, I tried to go one step further and have OPS. I keep getting an error when when I add a 4th line saying
    , obp + slg as ops
    Are you not able to use the new labels in the same way as the ones that are already given in the table?

  6. Colin Wyers says:

    I’d just make a new database, John. There are several new tables in this release, and some new fields in the batting table as well. (I may be missing some other differences – the most recent release of the BDB was literally released yesterday. I’m still toying with it myself.)
    Sky, I’m really unfamiliar with MySQL on OSX. I know the front end will be different (I’ve used Navicat on Windows before, and I know that’s available on Macs as well). MySQL should be well-supported on that platform, however.
    And you’re right, Josh – you can’t use aliases like that. There are several solutions, depending on whether or not you’re using GROUP BY. If you’re not, here’s how I would handle it:
    SELECT playerID, AB, H, 2B, 3B, HR, BB
    , SO, IBB, HBP, SH, SF
    , H/AB AS AVG
    , @obp := (H+BB+IBB+HBP)/(AB+BB+IBB+HBP+SF) AS OBP
    , @slg := (H+2B+2*3B+3*HR)/AB AS SLG
    , @obp + @slg AS OPS
    FROM batting
    WHERE yearID = “1990” AND teamID = “KCA” AND AB IS NOT NULL;
    The @ sign indicates a “user defined variable.” I’ll go into more detail about that later.

  7. Redsauce says:

    Great topic to coverThe book Colin, it is very helpful. For those looking for further eading, ‘Baseball Hacks’ by Joseph Adler (http://oreilly.com/catalog/9780596009427/) also covers this topic. I found it very useful when I started messing around with the numbers. And no, I’m not in any way affiliated with the author or publisher.

  8. john says:

    I agree. Baseball hacks is pretty good.

  9. Josh Engleman says:

    Colin,
    Thanks. That’s exactly was I was looking to do.

  10. Jake Russ says:

    Just great, something else to keep me distracted from my grad school work.
    In all seriousness Colin, thank you for writing this and I’m looking forward to the next one. I’ve been using excel, access and Stata. If I could switch over to just using MySQL to store and play with data & Stata to run the regressions that would eliminate a step.

  11. Colin Wyers says:

    Baseball Hacks is a great book, no doubt. Just as a fair warning, though, to people who are considering it – a lot of the code samples in there are buggy or no longer work (particularly the Gameday stuff).
    Also – and this may just be me – but I dislike Perl heavily. I keep toying with the idea of looking into Python for some of my scripting needs.

  12. Colin Wyers says:

    STATA is a bit (a bit!) beyond my means. I use Excel or GNU R for hardcore number crunching, although I must admit that I’m really a novice at that stuff. I’ve heard some nice things about gretl, and I may try that out soon.

  13. wcw says:

    My opinions, mostly not from baseball but from my real job, primarily as an investment geek, and occasionally as a large-dataset mechanic:
    – Excel for ‘number crunching’ is ever and always a mistake. I have been forced by circumstance to use it for same for a decade and a half now, and all the process ever gets is worse. If you don’t have, don’t.
    – stata is a nice program, but too expensive. Its real advantages over R (assuming you can afford it) are multicore awareness and memory efficiency. Me, I prefer using R and buying more RAM. I like R a lot, despite its manifold warts and the annoying — but supremely useful — user community.
    – once you start hitting decently large data sets (the Census’s ACS, say, with its 3 million rows and million households and many columns), R and Stata will not cut it for certain purposes. Again assuming you can beg borrow or steal it, Sas is good here. Avoid spss like the plague it is. If you can’t afford Sas, I believe pspp (the free-beer/free-speech spss replacement) is now working adequately, if imperfectly. Use it to prune your problems enough to run them in R.
    – mysql sucks, sucks, and sucks some more. That has almost nothing to do with transactions and ACID. It has everything to do with performance (mostly poor, be honest) and standards compliance (mostly absent). If you don’t care about standards and just care about performance, even if you have drunk the mysql koolaid you might as well roll your own with Berkeley DB tables or something — that performance should kick mysql in the nuts in the absolute friendliest of circumstances. Or hell, use flat files. But mysql is a network-effect disaster. Ugh. Me, I favor postgres, but in a world with postgres and sqlite and where Oracle and DB2 and sybase (I think) are free to download for noncommercial use there is no excuse for anyone ever to use mysql again.
    But hey, I’m just an internet crank who’s been running numbers of one sort or another for more than ten minutes. What do I know?

  14. Brian Cartwright says:

    I have MySQL installed, but I haven’t dragged myself away from Access’ GUI. Only problem with Access is that his a maximum size – when working with RetroShet I can only load ten years worth of events at a time. Stuff like Baseball Databank tun fine. In fact, I have DBD, KJOK & Westbay’s Japan db all merged into my Retro db.
    In Access, I can explicitly declare user defined variables by putting them inside braces, such as [OB]+[SA] AS [OPS]

  15. Kanka says:

    Sky,
    I’m running MySQL on a Mac, but I’m such a nerd that I really don’t use a GUI tool. I use MySQLcc in Linux, but it looks like that doesn’t play nicely with a Mac. mysql.com has some GUI tools (specifically the Query Browser) if you want to take a look:
    http://dev.mysql.com/downloads/gui-tools/5.0.html

  16. Kanka says:

    I agree that Baseball Hacks is a good book. But I’d love to see a second edition made, especially if it include newer stuff like PitchF/X.

  17. Colin Wyers says:

    I don’t think there’s an abstract situation where you can say that MySQL’s performance is always worse (or better) than the competition’s, wcw. For my purposes (not necessarily the BDB) I’ve gotten quite far with using MyISAM and partitioning. I don’t need to use transactions, and so having the ability to do away with that overhead is nice.
    As for Oracle, I believe there’s a size limit of 4GB for free use, which is insufficient (the events table alone for my Retrosheet database is 8 GB, not to mention the games and rosters tables.)
    I don’t pretend to be an expert; I don’t even pretend to know that MySQL is the most optimal solution for me. But it seems to work fine.
    I really dislike the “holy war” aspect that seems to be growing up around a lot of software choices these days. I’m at the point in my life where I just want to have software that I can use to get to doing the things that I really care about.

  18. Brian Cartwright says:

    I have MySQL installed, but I haven’t dragged myself away from Access’ GUI. Only problem with Access is that his a maximum size – when working with RetroShet I can only load ten years worth of events at a time. Stuff like Baseball Databank tun fine. In fact, I have DBD, KJOK & Westbay’s Japan db all merged into my Retro db.
    In Access, I can explicitly declare user defined variables by putting them inside braces, such as [OB]+[SA] AS [OPS]

  19. Hyltzn says:

    Colin, after I try to open it to install it, it tells me this:
    “This installation package could not be started. Contact the application vendor to verify that this is a valid Windows Installer package.”
    I downloaded it through the main link, and then I tried a couple of different mirrors. Didn’t work.
    Any help?

  20. Colin Wyers says:

    That usually means you have an older version of Windows Installer. Download the latest version, and that should solve your problems.

  21. Sky says:

    Any thoughts on phpMyAdmin? I already have MAMP set up on my 10.3.9 OS… I read that it’s intended for web database administration. Is that bad? Good?

  22. Sky says:

    If it helps for other Mac users, I have the BDB loaded using MAMP (I think it already comes on a Mac or else you download it from mamp.org) and Navicat as the GUI. Navicat has a free version without some nice features, but I’m currently on the 30-day full version free demo. Importing the .sql format was a bit counter-intuitive — don’t use the import wizard. Instead, right click on on the empty database in the left pane and select “execute batch file”. Select the .sql file to import and let it do its thing.
    *** ***
    I assume you StatSpeak guys are aware that nested comments are screwed up? I.e. out of order…

  23. wcw says:

    The problem is network effects.
    Every additional person who uses a terrible program like mysql, even once and even for a problem for which it is not too ill-suited, adds to the user headcount. That userbase size is what keeps this craptastic item in motion on a grand scale, for every problem, for many of which it is deeply ill-suited.
    As a result, I feel happy waging holy war against mysql even for small problems like this. If even one of the dozens of baseball obsessives who reads this post installs postgres or db2 instead of mysql, that’s a tiny little victory in the fight against abysmal software turned into the default solution by network effects.
    Once the network effect is dead, I’ll happily support mysql for those problems to which it is actually suited. When the network effect is dead, however, nobody will write an intro for newbies in which mysql is suggested as a default.

  24. Colin Wyers says:

    wcw, do you see how perhaps this approach isn’t exactly incentivizing me? What you call a network effect I call a lot of cheap, plentiful documentation, available tools and a support community. Why should I give that up? Why should I redo all my schemas? You can talk about performance but for what I’m doing MyISAM and all transactional processing disabled is probably tops in performance.
    Maybe you’re right in the long run, I’m really not qualified to say. But you really haven’t given me a reason to care. I sure don’t want to have to resort to writing queries in Berkley DB or use flatfiles and write my own querying engine from scratch. Being right is no substitute for being persuasive.

  25. Rally says:

    Any advice for pulling tables that are already in Access to Mysql? I’ve got the retrosheet event files set up in about 10 Access databases.
    I see the option “Restore from SQL dump” which I’ve used to but the baseball databank and gameday files in, but I’m not sure how to bring in tables from another DB or from text files.

  26. Colin Wyers says:

    It’s pretty simple; you just have to do it from Access.
    Install the ODBC connector. Here’s some documentation for using it with Access. To be honest it’s been quite a while since I’ve set it up, so I’m a little murky on the details. But if you need any more help, feel free to ask.

  27. Rally says:

    I’m missing a step. I tried following the documentation, but after I select export and chose ODBC, it gives me a box that says export with 2 buttons, OK and cancel. There is no dialogue box.
    It either does nothing, or exports somewhere I have no clue, as I never was able to give it a file location.

  28. Colin Wyers says:

    You should be able to export directly to MySQL, with no need for a file inbetween.
    What version of Access are you using?

  29. Rally says:

    Access 2003. I don’t think we’re even speaking the same language here. And it’s entirely my fault, I don’t know what I’m doing here.
    When you say export directly to mySQL, where exactly should I be pointing to? On my computer it installed to program files/mysql. Under that are two folders, connector ODBC 5.1 and MySQL server 5.1. There are a lot of subfolders under that.
    A friend is coming over for dinner who does database work. I’ll see if he can help set this up.

  30. Anonymously says:

    I’m trying to find the highest career OPS for in the history of the majors, but I’m receiving some errors. Can you tell me what I’m doing wrong?
    Here’s what I’m typing:
    SELECT playerID, AB, H, 2B, 3B, HR, BB
    , SO, IBB, HBP, SH, SF
    , H/AB AS AVG
    , @obp := (H+BB+IBB+HBP)/(AB+BB+IBB+HBP+SF) AS OBP
    , @slg := (H+2B+2*3B+3*HR)/AB AS SLG
    , @obp + @slg AS OPS
    , SUM(AB+BB+HBP+COALESCE(SF,0)) AS PA
    FROM batting
    HAVING PA >=3000
    ORDER BY OPS Desc
    LIMIT 50;

  31. wcw says:

    SQL is a standard. You could install any SQL server and avail yourself of every mysql doc/tool/community that conforms to the standards. The only thing unique about the latter is for help when mysql is broken and/or nonstandard.
    I honestly doubt mysql is tops in performance for this sort of single-user work (I’d bet on SQL Server myself, but but who knows?). My point about flat files and BDB tables was that you do not want ‘tops in performance’ here, or you’d be using them. You want standards compliance and non-brokenness.
    But hey, if it works for you, bully. I just wish you and those like you wouldn’t immediately point newbies at it without noting that the program has serious issues, and that there are many alternatives.

  32. aidebra says:

    I it contain hemet car insurance comparison .online car insurance saint petersburg could saint paul car insurance rates newburgh car insurance comparison ,the need for car insurance in la
    about can ma auto insurance The best too car insurance laws .so [url=http://www.edgar-schwarz.de/mm/CarInsurance?action=AttachFile&do=get&target=m]car insurance quote cheapest[/url] the cheapest car insurance in town .

  33. Samg says:

    for sum reason, it doesn’t like my SUM function.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: