# Control Groups and Selection Bias

A couple weeks back we discussed standard deviations, t-tests, and z-scores in primer fashion to help alleviate any confusion stemming from the usage of certain terms in our articles.  I got a very nice reaction from readers, asking for more, so I’m going to continue this “series” at least twice a month.  Today, the topics at hand are control groups and selection bias.
A control group is the group in an experiment or analysis used to determine if, say, a drug or a skill in baseball actually causes the overall effect.  Recall my post on whether or not home run derby participants improve/decline following their participation: After comparing the actual second half numbers to the projected second half, we found the actual statistics, save for HR/AB, exceeded their projections for those who took part.  This would lead many to assume that derby participants lost a bit of power but ultimately performed better over the remainder of the season.
The problem is that we have no way of knowing if the derby itself is the cause if the study ends that way.  We know what did happen to the participants but without a control group cannot say for certain why it happened.  To really answer that question we would need a control group–a group of players with similar AB/HR rates who didn’t participate in the derby.  This way we could see if the increase in BA/OBP/SLG and decrease in HR/AB is random (if both groups of data behave the same way) or if the derby truly is the source (if the control group posted different or opposite results).
Control groups are usually thought of as placebo groups because, in medical tests, they will be given a placebo drug; this way those conducting the experiment can see if the drug itself is actually taking effect.  If you are conducting an analysis attempting to prove a reason for a variance in data, like the home run derby example, you need a control group to gauge if the variance is directly related to your drug, skill, or whatever else is being tested.  Without the control group, we don’t know what the test results in certain circumstances mean because we only have one side of the story.
Now, a selection bias deals more with the selected sample of data to analyze.  If your study suffers from a selection bias it means that the results may not be meaningful because the data assembled was “tainted.”  Reverting back to the derby study, the initial attempt of many including myself involved a straight up comparison of first half vs. second half numbers.  This was incorrect because a selection bias was present.
See, I was beginning to test if players who were in the derby got worse in the second half, but what I wasn’t taking into account was the fact that those in the derby or all-star game were likely overperforming their true talent level in the first half.  Naturally, they would be due for a second half regression.  Is this the case for all players?  No, but it could seriously taint our results.  In that regard, while the sample I selected was not biased, the actual data was; to fix it the test needed to compare actual second half numbers to projected second half numbers based on the three previous years as well as the first half.
Once that step was taken, the selection bias lived no more.  These biases can also stem from the actual sample chosen.  For instance, if you wanted to measure how popular Kenny Chesney was in Philadelphia, and chose to run your test at a concert of another country music singer, your results would suffer from a selection bias.  You wouldn’t be testing a fair random representation of the population.  Instead, you would be testing his popularity amongst those already predisposed or amorous towards the musical genre.   The results of this population would likely differ immensely from other more fair, and random, populations.
In that sense, selection biases can sometimes take the shape of a convenience factor, wherein those conducting the study will run it upon a population that is designed to produce the output they desire.  Their results will not be meaningful.  A selection bias can surface in numerous forms but you should walk away from this post understanding that it involves a tainting of the results due to either an incorrect or biased assembly of the sample data.
In summation, control groups help us determine if the data variance of the sample being tested is directly linked to the reason for the test, or the results are no different from any other group in a similar population.  A selection bias prevents those results from making a difference because the data being analyzed was not assembled in a way so as to properly represent the population we are examining.  It would be like using a control group of players with the lowest AB/HR in the league to compare to those in the home run derby instead of a like-minded group.  These are two facets of statistics that need to be considered before a study searching to pinpoint reasons for variance.  If they are not taken into account, then you may have spent numerous hours assembling the wrong data, or, finished the study and soon realized you’re only halfway there and now need a control.