⚠ COVID-19 INFORMATION: Vaccine Information, Other Resources 

Making sense of your data

Quite rarely, we are given a data set to analyze that tells its own story pretty readily. More typically, we need to figure out how to make the data tell the story hidden inside. If someone hands you a dataset and asks you to figure out what it has to say, where do you start?

My beginning point is always with the sense-checking that a previous post outlined. When you sense-check the data, your job is just to make sure the data ARE what you think they are. When you make sense of the data, your job is to figure out what the data have to tell you.

Shaking the story out of the data

Sense making with data always begins with a question. Whoever has asked you to analyze the data should have at least one question they need the answer to. If they do not, then you should wait until they do. Subject matter experts can take a lot for granted when they ask for data to be analyzed – mostly that an analyst understands their subject area as well as they do and knows the important questions. So analysts need to push back and get clear, quantifiable questions.

But often the way the subject matter experts hope the data to answer their question can be wrong. For example, if you are asked to look for a cohort of patients with a certain condition, you many end up with far fewer patients than the subject matter expert expects. This can be because the patients are miscoded, but it can also be that each patient needs many services and visits so that the perception of the subject matter expert is that there are more patients than there really are. Local knowledge is very useful in pointing to important questions to ask, but it can often be misleading in pointing to answers. The analyst may be a little on their own after the questions have been drawn out of the requester.

Where to start?

My starting place usually focuses on identifying the key fields that can be used to answer the question. In addition to fields that might confirm the hypotheses from the requester, I am also interested in fields that might challenge (or infirm) their point of view. Good analysts always play a little bit of a devil’s advocate role to make sure we are not simply confirming whatever biases the subject matter experts may bring to their questions. Thinking out the fields to pull for the analysis is the first step, and I am usually pretty generous in my data pull (in a focused way).

Also, before touching any data, I take the time to work out how the data could answer the question. Sometimes, this means I need to think through proxies and composite metrics.  We rarely have exactly the measures requesters want because data collected in the medical record are not collected just to answer our question.  We need to be thoughtful about how we impute the measures we want from the ones we have.

Second, distributions in healthcare are often not ‘normal,’ which means that many assumptions we may want to make about associations and value changes will be fraught. Indeed, it is quite common to have what is called a skewed distribution – one large spike with a very long tail on one side and a short or no tail on the other. Often times, your only option is to code these as bivariate variables.

Consider for example a distribution that shows the length of stay for appendectomies. You will undoubtedly have a huge spike of admissions with one or two days in the hospital and then a small group with highly variable longer stays. Any kind of averaging of length of stay, of cost, of pharmacy orders will make very little sense. Solve this problem by recoding the variable into two groups: 1-2 days and more than 2 days. This allows you to begin the comparison process without the handicap of the outliers - but you also keep them in the analysis.

There are many ways grouping data this way that can solve analytical problems. Coding continuous fields in up to five groups (depending on the data frequency distribution) can add structure and clarity to an analysis. And as you work with the data and learn more about what the data have to say, you may choose to change this structure midstream. For example, if you have broken the data into five groups based on the distribution, but two of the groups behave the same in your comparison over and over again, combine them. Simple is always better as long as the story is accurate.

Before investing too much in grouping, make sure you know how the subject matter expert thinks outliers should be handled. For example, if you have been asked to describe the ‘typical’ care process for a condition, outliers can often be ignored. But if you are examining the impact of a certain group of conditions on the care team, outliers would need to be included.

Dealing with missing data

Assuming that your data are missing for a legitimate reason and not because there is a problem with the data pull, you have a number of options. Each option to work with missing data rests a different set of assumptions.

Let’s take an example. You have a data set in which 35% of patients received a flu shot, 40% did not receive a flu shot because they told the doctor they had one already, and the remainder 25% of cases have no data in the flu shot field.  Your question needs to be – should this last 25% of cases be included in the analysis at all?

Using the data you have, you might go about trying to find evidence that each of the missing values can be assigned to one of the following groups:

  • No service because it was not needed
  • No service because it was not ordered
  • No service because flu vaccine was not available
  • No service because the wrong provider got the reminder to give the shot

Some of these questions can certainly be answered with the data in any electronic health record. Each time you answer one of them, you assign a non-missing value to another group of cases that defines what the data are missing. This shrinks the size of your missing data problem, sometimes considerably, because you know why the data are missing and you can consider the case to be ‘legitimately’ missing data.

Another example is whether or not follow up care occurred after a surgery. The dataset may have many missing cases. Here are some questions to ask –

  • Would the patients have received the follow up care in your healthcare system (perhaps not if they live a long distance away)?
  • Would the patients all need the same follow up care, on the same schedule?
  • Does a phone call or emergency visit (rather than an office visit) count as follow up care and under what circumstances?

Sometimes, if you look at missing data from the patients’ perspective, a lot of options open up and you can assign a lot of missing cases to the ‘legitimate’ missing category.

Now start having fun

Once you have grouped your data properly and dealt with missing data, the real fun begins. You can look for associations and patterns in the data that begin to tell the story. If you have been cautious with cleaning and coding your data, you can have much more confidence in your findings than if you rush through it.  And if you have a data analysis plan, you now can start making progress on it.

Subscribe now to have updates from The Why Axis delivered to your inbox.