Stats 201: Binary logistic regression (or why your team didn’t make the playoffs)
April 29, 2007 10 Comments
I suppose that it’s the mark of a true nerd that I actually have a couple of “favorite” statistical techniques. (Nerd pride!) My training is as a psychologist and I spend a lot of my time at my “real” job studying human behavior, so a lot of the questions that I ask when I am playing around with my Retrosheet files go something like “Why did he do that?” Why did the third base coach send the runner there? Why did the voters go for Colon over Santana in 2005?
Baseball is full of a thousand little decisions, made on the field and off. Players don’t always make the same decision each time, but there do seem to be some pretty stable patterns that develop over time. The nature of baseball, though, is such that the decisions can usually be broken down into a simple form: yes or no? Should I or shouldn’t I? Either I swing or I don’t. (I may check my swing, but the umpire will determine yes or no whether it counts as a swing.) Now, many of these decisions are made thousands of times over the course of a season (e.g., swinging or not), so we usually have plenty of data points on which to base our research.
The problem is that when looking at questions like “What factors influence X?” we usually reach for a multiple regression technique. What factors influence HR rates? Throw in a bunch of predictors and see what shakes out! The problem is that an outcome that is a yes/no question doesn’t quite work like that. We need a statistical test for just this type of occasion. We thankfully have one: binary logit regression.
Hardcore math alert ahead.
Let’s take the simplest yes/no question in baseball. Did my favorite team win the game tonight? If baseball games were decided by coin flips, then I would expect that the answer would be yes 50% of the time and no 50% of the time. But, of course, baseball games are not usually 50/50 propositions. (Don’t believe me?: The Twins are playing the Royals. It’s Johan Santana against whomever the Royals just called up from AAA. Whom are you picking?) The circumstances affect the odds. But at the end of the game, the Twins will either get 100% of a win or 0% of a win.
Statistically, this is a problem. Most dependent variables that we use in regressions are continuous in nature (they increase incrementally, like HR increase by one every time you hit the ball over the wall), and can be assumed to follow in statistics what is called a normal distribution. (I should note that often times, this isn’t really the case, but we go ahead with it anyway.) Yes/no questions (in statistics, we call these dichotomous questions) follow a different distribution called the Poisson distribution. Fortunately, the Poisson distribution can be turned into something close enough to a normal distribution through a few math tricks. First, take the probability (p) of an event. Let’s say you think the Twins have an 80% chance of winning that game. Now, we need to turn that into what is known as an odds ratio. The formula is easy (p / 1-p). So in this case, the odds ratio for 80% is (.8 / .2), which equals 4.0. If anyone has ever given you 4:1 odds, that’s what they meant. They believe that the chances of the Twins winning are 4 times greater than that of the Royals. In this case, we take the actual percentage of time that the Twins won the game. Next, we take the natural log (logarithim with a base of e) of the odds ratio. Now, you’ve got a quasi-normal distribution which is good enough for government work. This method is also actually the proper way to work with variables that are rate variables, such as strikeouts per plate appearance, in a regression… not that anyone actually does that.
If you have a program that already does binary logit, it will do this all for you. You can then input your predictors into a regression the same way (perhaps I need a tutorial on regression more basically?) and see what shakes out. Like any regression, it takes the data and tries to fit a function to it that best describes what’s observed. Let’s take an example. In this article by my predecessor on this blog, David Gassko takes a look (in part) at how probable it is, given a certain number of wins, that a team will make the playoffs. (He even uses binary logit!)
Take a look at the second graph in that article. It shows the basic nature of a binary logit function. Those of you who have familiarity with regression are likely used to simple linear models in which one extra unit of independent leads to a specific increase in the dependent variable (the regression coefficient). In this case, we see that one extra unit of the independent variable (one win) might have a small or a huge effect on the chances of making the playoffs, depending on where in the distribution it happens. A team with one win won’t make the playoffs, nor will a team with two wins, and so on until about 80 wins (*cough* ’06 Cardinals *cough*). A team’s 107th win is just overkill. You’ve clinched, guys. Ease up. The 90th win, on the other hand, seems to add about 12 percentage points to the probability of making the playoffs over the 89th win.
Let’s run a binary logit of our own. I took all teams from 1995-2005 and coded for whether or not they made the playoffs (a yes/no question). Then, I input number of runs scored and runs allowed into the equation as predictors. Both factors (as you might expect) were significant predictors in the equation. The final equation looks something like this:
.00842 – .00299 * RA + .00291 * RS.
This is NOT a formula for winning percentage or, in this case, playoff chances. This is the formula for an exponent. Solve the equation, for whatever values of RA and RS you like. Call that number X. To find out the probabilities of making the playoffs given those parameters, solve for X and plug X into the equation (e^x / (1 + e^x)). That will tell you what chance that team has to make the playoffs. (Side note: You will also notice that runs allowed are weighted slightly more heavily than runs scored. Perhaps pitching really does win championships. Perhaps the good teams were winning a good number of home games where they only came to bat eight times, not needing the bottom of the ninth.) Last year, my beloved Indians scored 870 runs and allowed 782. Using the formula, we take .00842 – .00299 * 782 + .00291 * 870, which equals 0.20194. Plugging that into the equation, (e ^ .20194 / (1 + e ^ .20194), this equation says that the Indians had a .5515 chance (55.15%) of making the playoffs last year, given their runs scored and runs allowed.
Like any regression equation, a good researcher should post his R-squared value. The problem is that there isn’t a real R-squared, because R-squared is usually based on Pearson’s model of bivariate (and eventually multiple) correlation, which assumes bivariate normality. (If you have no idea what that means, don’t worry.) To get around this, statisticians most often report the Nagelkerke R-squared, which really isn’t a true R-squared value, but can be roughly read the same way. R-sqaured tells you how much of the variance in the outcome variable is explained by your predictors. In this case, the Nagelkerke R-squared was .655 or 65.5%. So, 65.5% of the variance in the odds that a team will reach the playoffs can be explained by the two factors, runs scored and runs allowed. So, of all the things that could push a team’s chances of making the playoffs up or down, I’ve figured out 65.5% of the recipe! (David’s example using wins has an R-squared of 74.3%. It is, after all, wins not runs that get you into the playoffs.)
I like to use binary logistic regression to study the decision-making process in baseball (everything from the decision to pick one player over another to waving the runner home when he hits third base), but it can be used to study anything where the outcome is either one thing or another, but never both. In baseball, you’re either safe or out. Now, we can study why.