⚠ COVID-19 INFORMATION: Vaccine Information, Other Resources 

Fixing a biased sample

In previous posts, we have discussed how to determine the size of a sample you need, what you should consider that might complicate your sample size decision, and how to take a simple random sample in a clinical setting.  But even the best plans in data-collection-land can go wrong. What do you do then? What do you do when your sample turns out not to not be representative of the population you want to study?

Extend and focus data collection

The first option is to go back into the field and collect data that will balance your sample. For example, we once conducted a study of hospital inpatient parents.  It turns out that parents of children under age nine were far more likely to participate in the study than parents of adolescents (they spend more hours at the hospital each day than parents of older children and were easier to track down).  So we extended the data collection period for parents of adolescents to make sure that the parents in our sample represented our full parent population. This is the best option, but it is not always possible because of budget or the implementation phase of your quality improvement project or study.

Weight the data

You cannot turn a biased sample into an unbiased one, but you can weight your sample to make sure you are comparing apples to apples. Weighting a sample is a common procedure performed by sampling experts to refine random samples and minimize bias.

For example, if 40% of your sample is insured by Medicaid and 50% of your population is insured Medicaid, you can mathematically weight the Medicaid-insured patients more heavily in your statistical analysis — so that when you present findings, the sample acts as a representative one (that is, 50% of the sample will appear to be insured by Medicaid). In this case, each Medicaid patient would appear to the statistical software to be 1.25 of a person and the commercially insured participants would be treated as .833 of a person (see box).

If you need to correct more than one factor (like insurance type and race) the calculation becomes more complicated and you should consult a statistician.

SPSS and SAS both have commands that apply the weights you have created when statistical procedures are conducted (Excel does not). After you have created the weights, you simply turn the weight ‘on’ in the program to activate the weighting process.

This method of refining a random sample is only acceptable to use if the differences between the sample and the population are not very great and if there are enough cases in the under-represented group to truly represent that group. For example, if only 5% of the sample is Medicaid-insured patients and 50% of the population has Medicaid insurance, there is a very small chance that you can reweight the data. In a sample of 100, 5% would only be five cases. Five cases cannot possibly represent all the complexity of 50 cases. 

If you are not a statistician or a sampling expert, I recommend consulting one whenever you decide to weight data.  Misusing weighting procedures can lead to very misleading results, and not using them when they are appropriate can also lead to misleading results.

Use nonrepresentative elements of your sample as controls

‘Control for’ everything in your sample that is not representative of the population.  This is a far more straightforward process with multivariate analysis than univariate or bivariate analyses (where I would recommend weighting the data if possible).

In a multivariate procedure, you have one outcome (or dependent variable) you are trying to understand.  If you know you have unbalanced groups (in terms of the proportions you would see in the population), you can add these as independent variables and assure that the statistical procedure ‘corrects’ for them (or ‘controls’ for them).

For example, in the study I mentioned of hospital parents, it turned out that when we collected the second time point, we had far fewer Hispanic families than the first time point in the study. If we had done nothing, and we showed a change from time point 1 to time point 2, we might have wondered if the change we saw was related to the different ethnic compositions of the two groups.  We handled that by adding Hispanic (yes/no) as a control variable in our regression model.  That forced the model to compare like to like between the two time points, essentially removing any ethnic group-related bias that was present. 

Exclude the non-representative group from your study findings

This is a radical and depressing solution, but often the only responsible one. If a portion of your sample does not represent the portion of the population you need it to, and you cannot responsibly apply any of the fixes offered here, you need to drop that portion of the sample from your study results. You would then need to clarify that the sample does not represent the full population when you report the results.

This happens a lot with clinical data, especially as health care providers attempt to tackle more and more social determinants of health.  Let’s take a highly socially complex population, like low income children, in a study of medication compliance over the period of one year.  Homeless families are a subset of this population and are very challenging to enroll in studies and especially to follow up with after enrollment. If you were only successful at enrolling a handful of homeless families, you would likely need to limit your reported results to ‘families below the poverty level with reliable housing.’ You may be able to report the results on the homeless sample in a qualitative study, but if you do not feel you can represent ‘all’ of the homeless population you care about, you need to keep them out of the quantitative analysis.

Experienced data collectors put a lot of thought (and budget) into hard-to-reach subpopulations.  There is a huge payoff because often the hardest to enroll patients often have the most to benefit from health care improvements. 

If push comes to shove, it may be advantageous to seek just a handful of members of the hard to reach population and design a substudy using qualitative methods to represent the group. It really depends on what you want to learn about them, but using two sets of methods – one for the easy to enroll and another for the hard to enroll – is certainly a tried and true method.

In sum

Please always check that the random sample you pulled is truly representative of the population.  If it is not, the findings you report cannot be generalized to the population that is the target of your study.  Once you check the whether the sample is random, if you find it is not representative, there are some fixes that will save the day.


Subscribe now to have updates from The Why Axis delivered to your inbox.