Analysts talk a lot about data validation, but before data are validated and before they are used for any analysis, they need to be verified. Many of us think of this as ‘simple’ sense-checking – do the data we have pulled make sense considering what we have asked for? This step comes before validation and analysis.
There are three basic steps I take with every new data set I receive. The steps themselves are not complicated, but they can lead to pretty complicated fixes.
1. Do you have what you think you should have?
The easiest way to find this out is to run a frequency on each key variable – how many cases are in the data set? Is it about what you would expect? Running frequencies of age of the patient population, dates of services, service type and other key variables used in cohort selection will quickly tell you if you have the right data set.
For example, if you were wanting to pull the NICU discharges for your hospital, you would want to know that the number of cases you pull matches pretty well other sources of discharge data for your hospital, that the age range of the patients would be mostly under three months of age, and that the discharge dates fit within the year you wanted to pull.
With healthcare data, it is common to have duplicate cases for one reason or another. Running a script to identity duplicate cases is easy. It is key to remember, though, that a duplicate case is the same patient getting the same service on the same day from the same provider. Cases that are similar for the same patient on the same day are not duplicates. So set these rules with care.
2. Are the data usable?
Running a frequency distribution on your data fields, which is the first strategy mentioned above, also tells you a lot about the quality of the data. Here are some things you might find:
Data type mismatches often occur. For example, date fields may be read by your software as text or numbers, making them impossible to transform when you get to the analysis. You may need to go back to the original data source and add some code to your data pull to fix this.
Outliers occur when one or a few cases have values that are far different from other cases. If the outliers are the tip of a long distribution tail, they typically make sense. But if they make up the tail all on their own, you might want to check them. Subject area experts are often helpful at interpreting outliers, but sometimes it will be clear to you that the outlier is an error (for example, NICU patient with an age of 67 months probably had their age recorded incorrectly).
Empty fields (‘missing' or 'null' data) sometimes indicate an error in the source data, an error in the data pull request, or may indicate no error at all. When you see missing data, the first question to ask is what proportion of cases are missing. If all cases are missing a certain data field, there is likely a problem with the data pull or the source data. If some of the cases are missing, then you have some work to do to figure out if the missing values are logical or not. If they are, the data are usable. If not, you may need to repull the data. Subject area experts can be extremely helpful on this score.
3. What is the data file’s unit of analysis?
Many times, an analyst needs to pull all visits or all interactions for a patient in order to identify the right patient cohort. If the analysis plan uses the patient as the unit of analysis, there are many ways to aggregate the services or transactions up to a single case. What is important as this stage is to make sure the data file you have can be aggregated in the way you want to down the line — all grouping variables are present and clean, all fields that will be summed, counted or flagged have values that are summable, countable and flag-able.
Even after these three steps are taken, you may find the data are not what you expect in important ways, sometimes not until you are deep into analysis. But if you follow these steps, you will have far fewer surprises down the line and your analysis will get completed on time most often.
Comments will be moderated.