Stats 101: The very basics of statistical analyses
April 25, 2007 4 Comments
One of my “real jobs” is teaching college classes in statistics and research methods at a large university somewhere in the Midwest. If you’ve ever wondered why I use a pseudonym, there’s one reason. Like a lot of Sabermetricians, I got into “the business” because I love baseball and my day job required me to know a lot about statistics. Then again, I probably owe a great deal of my familiarity and comfort with numbers and probabilities (the same thing that allows me to function in this job) to reading box scores when I was six.
The problem that a lot of would-be Sabermetricians cite as the reason that feel that they can’t contribute their own work or that they have a hard time reading others’ work is a lack of comfort with the statistical techniques involved. Let’s be honest: You avoided taking stats in school, if that was possible. If you had to take it, you held your breath until it was over, and were thankful for the C+ that you got. You then promptly forgot everything. I know, I teach the class. But, we are baseball researchers and, as I often tell my students, statistics is the language of research.
Over the next few weeks, I’ll be posting a few pieces on a few basic and not-so-basic statistical techniques useful in Sabermetrics. I’ll especially be covering the ones for which I have a special affinity. This first post is something of a baseline post for beginners. I direct it at those who need a refresher on the very basics or who have no formal training in statistics. If that describes you, start here. If you’re already familiar with the difference between descriptive and inferential statistics, standard deviations, and correlations, you can get off the boat here and not miss anything.
Statistics is a game of probability. Statisticians are in the business not of fortune-telling, but of determining the likelihood of certain events happening given certain conditions. The most likely outcome isn’t always the one that actually happens in that situation, but it is the one that happens the most often over time given similar conditions. If I told you that I had placed a bowl behind a screen with 70 red marbles and 30 blue marbles in it and that I had picked one marble at random (without looking) out of the bowl, you would tell me that the most likely outcome is that I pick out a red marble. Is it possible that I’ll pick out a blue marble? Of course. In fact, I have a 30% chance of doing so. Red is the most likely outcome and will happen most often if I repeat this process, but blue will happen.
If there is one thing about Sabermetrics that you must understand, it is this concept: If I say that the Red Sox have a 70% chance of winning this game against the Blue Jays, yet the Blue Jays win, it does not necessarily mean that my methods are flawed. It simply means that a lower-probability event happened. (Even Rafael Belliard occasionally hit a homerun!) One of the primary objections to Sabermetrics is that we often make say things such as “Johnson will have a breakout year” but he doesn’t, “Stevens has the best throwing arm on a catcher in the league” just before a runner steals second off him, and “The Tigers will likely finish in the middle of the AL Central”, but they end up winning the pennant.
The disconnect between fans and the general public is that fans place a lot of meaning on specific events (such as your favorite team winning the World Series) and it really only matters whether that event happens or not right then and there. I can run a computer simulation of the 2006 season 1 million times and tell you that (your team here) had a (insert number here) chance of winning the World Series, but all that matters is the fact that the one marble that was picked out of the bowl last year belonged to the St. Louis Cardinals. We are not fortune-tellers, merely researchers. But, more often than not, we’re right. I know… you don’t want to hear that right after your team whom we said was the favorite to win the series just lost.
There are generally two branches of statistics: descriptive and inferential. Descriptive statistics are all over baseball. To say that Ryan Howard hit 58 HR last year is a descriptive statistic (specifically, a raw event frequency). To say that his batting average was .313 is also descriptive. We are simply measuring certain properties of Howard’s performance last year. Descriptive isn’t a synonym for simple, mind you. More complex formulae like Runs Created are also descriptive statistics by virtue of the fact that they count how often a player had a particular outcome and weight each outcome. I could also say that the average major league hitter batted .269 last year, which is still descriptive, because even though I’ve moved away from talking about just one player, I’m simply describing the whole sample of Major League Baseball players from 2006.
Inferential statistics, in contrast, test questions of whether two measurements are related to each other over a set of observations and that the probability of that relationship happening by chance is low. When you hear that something is “significantly different” or that there is a “significant relationship” between two things (and not in the sense that Kevin Federline and Britney Spears once had a significant relationship), or want to say that one thing caused another, you have entered the realm of inferrential statistics.
A quick example on an intuitive level about inferential statistics. Let’s say that I’m standing in front of you holding a United States quarter and I swear that it’s not a trick two-headed coin nor a weighted coin. The actual odds of heads coming up is 50% and the same for tails. You, being the skeptical sort politely request that I prove it. I flip the coin 100 times and it comes up 51 heads and 49 tails. Do you still believe me that the coin is “fair”? You probably do, because you intuitively realize that while it’s not 50/50 exactly, it’s close enough. But, when do you stop believing me? 55/45? 57/43? 60/40? 75/25? Surely, there will be a point where you do stop believing me, but why there? After all, it is possible that a coin might land heads 90 out of 100 times, and for there be no trickery.
Well, we can calculate the probability that given a fair coin (that is a “true” heads/tails of 50/50), that such a discrepancy would show up. It’s generally accepted that if those chances drop below 5%, then we say that the effects must be due to something other than chance. In this situation, you are saying to me that the coin’s actual tendencies for heads and tails are significantly different from 50/50.
In the remainder of this post, I’ll cover two commonly used methods, one descriptive (standard deviation) and one inferential (correlation). I don’t have the space to teach how to calculate these things, either by hand or computer, but those resources are out there. The more important thing is that you understand the idea.
Standard deviation is a measure of consistency (or on the other side of the coin, variability) within a set of numbers. Any set of numbers has an average (in statistics, we call this the “mean”), but it also has a standard deviation. I like to tell my students that if you report the mean, you need to report the standard deviation. To use a baseball example, let’s say a player gets one hit in every game he plays for a total of 162 hits for the season. His average hits per game is 1, and his standard deviation (SD) is 0. He is remarkably consistent over time. The lower the SD, the more consistent the performance and the closer together each of the numbers in the list is. For example, the hit totals for each game would be much more likely to look like 1, 1, 1, 1, 1, 1, 1, 1. On the flip side, if a player has a pattern of getting 4 hits in some of his games, but 0-fers in the rest, then his performance is highly variable. His SD will be high. In case you were wondering, last year’s most consistent hitters in hits per game (lowest SD) were Andy Green, Ricky Ledee, Orlando Palmiero, Rob Bowen, and Doug Mirabelli. The most variable hitters were Ryan Therriot, Jose Reyes, Ichiro, Freddy Sanchez, and Mike Sweeney.
The neat thing about standard deviation is that it also allows for some comparisons across time. For example, how amazing was Babe Ruth’s 1919 season, where he hit 29 HR? 29 HR in 2006 would have put him in a tie for 34th in the majors with Pat Burrell, Jeff Francouer, and Garrett Atkins and hitting only half as many as Ryan Howard who hit 58. But homeruns were much rarer then. In 1919, Ruth’s 29 home runs were 28 over the league average (of 1 HR!), with a league-wide standard deviation of 3.55 HR. Now, 28 HR over the average is equal to 7.63 standard deviations. So Babe Ruth, in 1929, compared to the rest of the league, was 7.63 standard deviations better than the average player in terms of HR hit (this is called a z-score in statistical terms).
In 2006, the average player hit 10 HR, and the standard deviation was 10.59. So, home run totals were much more spread out in 2006 than in 1919 (as the standard deviation is higher). Howard hit 48 more homeruns than average, which represents 4.53 SD above the mean. In other words, Howard had a good year compared to everyone else in the league in 2006, but when you adjust for what everyone else was doing, Babe Ruth had an even better year in 1919 even though he hit half as many home runs.
Correlation is a technique which allows us to test whether two things are related to one another. (If you’d like to know how it’s calculated, look here.) Correlation can’t tell us whether one causes the other, just that they are related to one another. Correlation values are always between -1.0 and +1.0. If you see one outside this range, you’re dealing with someone who doesn’t know what they’re doing. In a correlation, you want to look at two things, separately: the sign and the number. The sign tells you whether the correlation is positive (as one of the things goes up, the other does as well), or negative (as one goes up, the other goes down). The number tells you how strongly correlated the two are. Something close to 1 is very strongly related (whether it’s positive or negative). Something close to zero is very weakly related.
In this post on the blog, I talked about how you can use correlation to your advantage in fantasy baseball. Knowing what’s related to what can be helpful to find good players to help your team, but Sabermetricians also use (and mis-use it) to study what skills in a player seem to “go together.” There is a way to test whether the correlation is significant or not, but it’s heavily influenced by sample size. That means that even weak relationships can look very “significant” if you have a 1000 people that you’re testing.
Responsible Sabermetricians (and researchers of any kind) should provide their “R-squared value.” R is the abbreviation for correlation. R-squared is just R raised to the second power (r * r). What it tells you is what percentage of their variance the two variables share in common. If they are strongly related, that will be a lot. Most data sets will have some variability in them. The question is why are there differences among people in the measures. To ask it another way, why do some batters hit 50 HR while others hit none? What is it that makes these hitters different?
A correlation (r) of .70 has an R-squared value of .49 or 49%. Let’s say that I found this correlation between a player’s weight and his HR totals. (Note: I did not run this analysis and this probably isn’t true, as I completely pulled these numbers out of the air.) This would tell me that I can explain about half of the differences in overall homerun totals by players being heavier or lighter. The reason for R-squared being so necessary is that sometimes, correlations in the neighborhood of .10 can be significant. But all that tells you is about 1% of the variability. That could be interesting to know, or it could be meaningless, depending on the variables involved.