Taming the beast, part 2
Picking up on our beginning point from Taming the Beast, part 1, this post will focus on how to turn beastly data into polite, tidy youngsters who can eat in public on their own.
How is that done? For me, it is a complicated and ever-evolving process. First off, it helps to not fight the data. Like all good parents, accept your data like they are and focus on 'developing' them into well-mannered young adults.
1. Outing the problems
The first step is to survey the data set and identify all the problems. Start by running a frequency on the fields you care about. Here are the things you will see:
- Missing values
- Incoherent values (such as numbers with letters in them)
Looking over these frequencies, you can start developing your to-do list.
If the frequencies look clean, you will want to proceed to logical checks. For example, all NICU patients should be newborns. Especially long stays for a diagnoses that are usually short stays should be double checked.
All fields that relate to a single service should agree. For example, if the charges indicate that the child was in the NICU, other services charged should be reasonably typical of NICU patients.
You need to put some elbow grease into thinking up ways to check the data for soundness. This is where good analysts shine.
You will end this part of the project with a list of problems. You may need to check in with clinical or other experts before moving to the next stage.
2. Fixing and testing
As you clean the data, you will want to document your decisions and assumptions as those will become part of the data dictionary for your analytic dataset.
Some of the cleaning is pretty routine:
- You will need to decide whether outliers are true outliers or errant values (e.g., a 43 year old in the NICU is obvious an errant value, but a 100 day old infant may not be).
- If your distribution is very skewed, you may want to perform a transformation that creates a new field categorizing the data into quartiles or a similar break down. Your decision will affect the statistical procedures you eventually want to run.
- If you have data with unexpected values in it, you will want to consult clinical experts to try to figure out how to interpret them, set them to missing, or impute a usable value.
Working with missing or null data: Missing or 'null' values (sometimes called 'nonobserved' values) show up in your data set as empty fields and are a thing to contend with. Sometimes they are empty because the clinician did not provide a service or enter some data that they SHOULD have, and sometimes they are missing because the clinician correctly did not provide a service or enter a value, and sometimes they are missing for a host of other reasons. If you are going to have accurate denominators when you run your analysis, you will have to sort this out.
For example, suppose you are looking at vaccines given to patients. Your data set is a list of all patients with all of their vaccine detail. You could imagine that a newborn would have a lot of empty fields. And you could imagine that only adolescents would have mostly complete data. You might want to distinguish between which fields have missing data because the child should not have gotten the vaccine yet (they were too young – what we term a 'legitimate missing' null field) from those in which the child should have gotten the vaccine but did not (what we would simply term 'missing' null field).
In other words, if you can identify that something that is null should be null in fact, you can assign that a value and use that information later (such as taking that case out of your denominator).
If there are multiple reasons for data to be legitimately missing and it is important to distinguish among them, you can use different missing value codes for each reason. In our example, you might want to distinguish between case with a legitimate missing value because the child is too young for a certain vaccine (you could assign the value '77' in that field) from a case with a legitimate missing value because the child just entered the country and we have no historical data for that vaccine (you could assign the value '88').
Sometimes I assign a number of these values as I learn more about the data just to help me figure out the underlying logic of the field. It helps me track what I know about the missing data and what I don't know. In the end, I know exactly which cases should be excluded from which analyses and why, and that knowledge is well documented for others.
If a lot of the data are simply missing for no identifiable reason, there may be a need to examine the workflow that goes into producing the data. Unreliable workflows result in unreliable data. In such a case, you may not be able to do any more with the data.
Replacing null values with imputed values: Under special circumstances, missing values can be replaced with values imputed from other data. For example, if you are missing age at discharge and you have the birthdate and discharge date, you can impute age at discharge. Sometimes services are bundled, and you can assume if they received the bundle, they received each of the services in the bundle.
Imputing can become quite complex and can stretch the bounds of reason. It is a quite controversial practice, but it is often perfectly reasonable, when exercised with intelligence and restraint. This is another area where good analysts really prove their mettle.
Document, document, document: As you work through these choices and test your solutions, it is often helpful to talk them over with others. For sure, document all of your decisions and all of the things you tried. Here is an example of coding I did for a project, with explanations embedded so that others could follow my logic. You will never understand the logic as well as you do when you are writing the code, so pay yourself forward and document both in code and in English.
3. Creating analytic fields
Once the data are clean, you are ready to create whatever fields or variables you need for analysis. If you want the total cost of the hospital stay, now is the time to compute it.
As it turns out, this offers yet one more chance to find problems with the data. For example, if the total cost of an outpatient visit that you compute is highly variable, you might ask if you have the right fields going into the computation.
It is always a good idea to check that computed fields result in values that make sense, just like you did with the raw data.
When I write code, I always distinguish between the section of code for cleaning and fixing (which is always placed at the beginning of my code block) and the section for analysis (which is always at the end). This helps me avoid running analyses later on only partially cleaned data.
4. Conduct analysis
Now it is time to conduct the actual analysis. It should be no surprise that during this stage, new problems with the data may surface. So fixing and testing will continue.
How clean is clean?
This is a vexing question. One of my earliest mentors taught me to keep cleaning until I could actually hear the data squeaking as it churned through my hard drive. But that is often not practical, especially with the extremely complex clinical data.
Here are some of the things I think about before I say I have done enough:
- Will more cleaning change the results of the data in a non-trivial way? This is absolutely the most important question to consider.
- Will more cleaning make the data more sensitive to quality improvement or process improvement efforts it will be used to evaluate?
- Will more cleaning result in findings that are more representative or more robust (meaning they can be used over and over in different contexts)?
- Do the number of problem cases make up more than 3% of all the cases and are those cases different from the others in any meaningful way?
If you answer is yes to any of these, you should probably keep cleaning.
BTW, the 3% cutoff is totally of my own making. After decades of cleaning data, it just seems about right for the complicated analyses we do. But there is nothing sacred about it. Use your best judgment after considering these questions. Talk it over with a colleague. In Finance, for instance, 3% is an absurdly high missing value rate. Context really matters.
So there you have it – the insider's view of establishing a valid and reliable analytic data set. We did not talk at all about normalization and data integration, which build on these processes, or about special kinds of data, like streaming or unstructured data. We will get to that in time.