Stats 201: Binary logistic regression (or why your team didn’t make the playoffs)

I suppose that it’s the mark of a true nerd that I actually have a couple of “favorite” statistical techniques.  (Nerd pride!)  My training is as a psychologist and I spend a lot of my time at my “real” job studying human behavior, so a lot of the questions that I ask when I am playing around with my Retrosheet files go something like “Why did he do that?”  Why did the third base coach send the runner there?  Why did the voters go for Colon over Santana in 2005? 
Baseball is full of a thousand little decisions, made on the field and off.  Players don’t always make the same decision each time, but there do seem to be some pretty stable patterns that develop over time.  The nature of baseball, though, is such that the decisions can usually be broken down into a simple form: yes or no?  Should I or shouldn’t I?  Either I swing or I don’t.  (I may check my swing, but the umpire will determine yes or no whether it counts as a swing.)  Now, many of these decisions are made thousands of times over the course of a season (e.g., swinging or not), so we usually have plenty of data points on which to base our research.
The problem is that when looking at questions like “What factors influence X?” we usually reach for a multiple regression technique.  What factors influence HR rates?  Throw in a bunch of predictors and see what shakes out!  The problem is that an outcome that is a yes/no question doesn’t quite work like that.  We need a statistical test for just this type of occasion.  We thankfully have one: binary logit regression.
Hardcore math alert ahead.
Let’s take the simplest yes/no question in baseball.  Did my favorite team win the game tonight?  If baseball games were decided by coin flips, then I would expect that the answer would be yes 50% of the time and no 50% of the time.  But, of course, baseball games are not usually 50/50 propositions.  (Don’t believe me?: The Twins are playing the Royals.  It’s Johan Santana against whomever the Royals just called up from AAA.  Whom are you picking?)  The circumstances affect the odds.  But at the end of the game, the Twins will either get 100% of a win or 0% of a win.
Statistically, this is a problem.  Most dependent variables that we use in regressions are continuous in nature (they increase incrementally, like HR increase by one every time you hit the ball over the wall), and can be assumed to follow in statistics what is called a normal distribution.  (I should note that often times, this isn’t really the case, but we go ahead with it anyway.)  Yes/no questions (in statistics, we call these dichotomous questions) follow a different distribution called the Poisson distribution.  Fortunately, the Poisson distribution can be turned into something close enough to a normal distribution through a few math tricks.  First, take the probability (p) of an event.  Let’s say you think the Twins have an 80% chance of winning that game.  Now, we need to turn that into what is known as an odds ratio.  The formula is easy (p / 1-p).  So in this case, the odds ratio for 80% is (.8 / .2), which equals 4.0.  If anyone has ever given you 4:1 odds, that’s what they meant.  They believe that the chances of the Twins winning are 4 times greater than that of the Royals.  In this case, we take the actual percentage of time that the Twins won the game.  Next, we take the natural log (logarithim with a base of e) of the odds ratio.  Now, you’ve got a quasi-normal distribution which is good enough for government work.  This method is also actually the proper way to work with variables that are rate variables, such as strikeouts per plate appearance, in a regression… not that anyone actually does that.
If you have a program that already does binary logit, it will do this all for you.  You can then input your predictors into a regression the same way (perhaps I need a tutorial on regression more basically?) and see what shakes out.  Like any regression, it takes the data and tries to fit a function to it that best describes what’s observed.  Let’s take an example.  In this article by my predecessor on this blog, David Gassko takes a look (in part) at how probable it is, given a certain number of wins, that a team will make the playoffs.  (He even uses binary logit!) 
Take a look at the second graph in that article.  It shows the basic nature of a binary logit function.  Those of you who have familiarity with regression are likely used to simple linear models in which one extra unit of independent leads to a specific increase in the dependent variable (the regression coefficient).  In this case, we see that one extra unit of the independent variable (one win) might have a small or a huge effect on the chances of making the playoffs, depending on where in the distribution it happens.  A team with one win won’t make the playoffs, nor will a team with two wins, and so on until about 80 wins (*cough* ’06 Cardinals *cough*).  A team’s 107th win is just overkill.  You’ve clinched, guys.  Ease up.  The 90th win, on the other hand, seems to add about 12 percentage points to the probability of making the playoffs over the 89th win.
Let’s run a binary logit of our own.  I took all teams from 1995-2005 and coded for whether or not they made the playoffs (a yes/no question).  Then, I input number of runs scored and runs allowed into the equation as predictors.  Both factors (as you might expect) were significant predictors in the equation.  The final equation looks something like this:
.00842 – .00299 * RA + .00291 * RS.
This is NOT a formula for winning percentage or, in this case, playoff chances.  This is the formula for an exponent.  Solve the equation, for whatever values of RA and RS you like.  Call that number X.  To find out the probabilities of making the playoffs given those parameters, solve for X and plug X into the equation (e^x / (1 + e^x)).  That will tell you what chance that team has to make the playoffs.  (Side note: You will also notice that runs allowed are weighted slightly more heavily than runs scored.  Perhaps pitching really does win championships.  Perhaps the good teams were winning a good number of home games where they only came to bat eight times, not needing the bottom of the ninth.)  Last year, my beloved Indians scored 870 runs and allowed 782.  Using the formula, we take .00842 – .00299 * 782 + .00291 * 870, which equals 0.20194.  Plugging that into the equation, (e ^ .20194 / (1 + e ^ .20194), this equation says that the Indians had a .5515 chance (55.15%) of making the playoffs last year, given their runs scored and runs allowed. 
Like any regression equation, a good researcher should post his R-squared value.  The problem is that there isn’t a real R-squared, because R-squared is usually based on Pearson’s model of bivariate (and eventually multiple) correlation, which assumes bivariate normality.  (If you have no idea what that means, don’t worry.)  To get around this, statisticians most often report the Nagelkerke R-squared, which really isn’t a true R-squared value, but can be roughly read the same way.  R-sqaured tells you how much of the variance in the outcome variable is explained by your predictors.  In this case, the Nagelkerke R-squared was .655 or 65.5%.  So, 65.5% of the variance in the odds that a team will reach the playoffs can be explained by the two factors, runs scored and runs allowed.  So, of all the things that could push a team’s chances of making the playoffs up or down, I’ve figured out 65.5% of the recipe!  (David’s example using wins has an R-squared of 74.3%.  It is, after all, wins not runs that get you into the playoffs.)
I like to use binary logistic regression to study the decision-making process in baseball (everything from the decision to pick one player over another to waving the runner home when he hits third base), but it can be used to study anything where the outcome is either one thing or another, but never both.  In baseball, you’re either safe or out.  Now, we can study why.

Advertisements

10 Responses to Stats 201: Binary logistic regression (or why your team didn’t make the playoffs)

  1. […] Did I mention numbers-based geekery already? Statistically Speaking’s infamous Pizza Cutter sends more your way, letting you know why your team didn’t make the playoffs using a post that’ll make you wish you stayed awake in calculus. […]

  2. John Beamer says:

    PC — what is the best software package to do this in? Keep up the good work?

  3. Pizza Cutter says:

    John, I use SPSS personally, although that’s a specialty program used in the social sciences for research (psychology, sociology). I actually don’t know much about other stats programs.

  4. Ryne says:

    For those OK with a non-GUI interface, R (r-project.org) is a free statistical program useful for all kinds of models. For those interested, I believe binary logit regression can be accessed with the ‘glm’ command with the addition of “family=binomial()” after the model statement, as follows:
    glm(playoff~RS RA, family=binomial())
    SPSS, SAS and the like are great if you have access, usually through a university. Great work, PC.

  5. “I suppose that it’s the mark of a true nerd that I actually have a couple of “favorite” statistical techniques.”
    That’s nothing to be ashamed of… Welcome to the club!
    Regarding the equation (e^x / (1 + e^x)): this can be more simply computed as (1 / (1 – e^x)), which produces exactly the same result.
    “…a good researcher should post his R-squared value.”
    I will vote against this one, in favor of assessing model performance on holdout data.
    “…what is the best software package to do this in?”
    My personal preference is MATLAB, but any decent statistics package will perform logistic regression.

  6. Sorry, that should be:
    (1 / (1 + e^-x))
    Good work, by the way.

  7. Jim A says:

    I’ve found R to be a good free package for someone dabbling in statistical analysis without access to commercial software. It may take a while for a novice to figure out how to do something in it, but there are lots of docs and help forums. The book Baseball Hacks has a bunch of good introductory examples using R, also.

  8. Kyle J says:

    For someone who had some basic training in statistics but has forgotten a great deal of it, these articles are great–both for baseball purposes and professional purposes. Keep up the good work!

  9. Bhagwan says:

    Hi
    I am a student of Lincoln University New Zealand. I am doing analysis of my research data. However, I am getting problem about the R2. Because in Binary logistic Regression Cox and Snell R 2 is very low. How can I increase R2? could you please suggest as soon as possible. Thanks

  10. Jason says:

    Actually, binary dependent variables do not follow the Poisson distribution. They follow the probit and/or logit distributions. And, the poisson distribution approximates normality as a function of the value of lambda, the mean rate of occurrence of the phenomenon. Since in terms of logical form lambda is equivalent to the logit, to hold that the distribution of a binary dependent variable can be modeled with Poisson regression would be to impose unacceptable constraints, since logistic regression does not have a homogeneity of variance assumption.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: