January 3, 2017

In providing some guidance on statistics, this blog has run a little ahead of itself. There are some basic things we need to cover about the nature of data and counting things that will make all the statistical work run much more smoothly. First things first, this post will talk about what data look like when they are pulled from Epic, what assumptions we can make about them, and what we can do with them analytically. Let's start with where the data come from.

There are many ways people who work with data refer to columns of data – the most ubiquitous being the word 'field.' 'Field' refers to a place in a data file or application where certain data always go. For example, age is always placed in the 'age' field and gender is always put in a field named 'sex' or 'gender.' When you pull those data from a data system like Epic, they show up as columns in a spreadsheet and are often then referred to as fields or columns.

Embedded in Epic (and any data system) is a lot of information about each field. End users don't see this embedded information, but much of it affects the data pull. Each field has a definition that not only includes the name of the field, but also information like: how many digits the field has (for age, typically three digits), what values are allowed in the field (for age in years, typically limited to 120), how the data are formatted (number of decimal points allowed and other special formatting) and what the data mean (in the case of age, days, months or years).

Once data are pulled for analysis, analysts often no longer use the term 'field.' At that stage, the fields are often referred to as 'variables.' The term variable comes from mathematics. It is in contrast to the concept of 'constant' – which most of us have heard of. A constant is a field that has a value that never changes (like pi). A variable is a field that has values that always change. In our EMR, pretty much everything collected on our patients is variable. The only thing that approaches a constant is something like billing provider (which is Lurie Children's, most of the time).

Once data are pulled and sitting in a spreadsheet, there are six types of fields or variables that analysts face. Each has different implications for the process of analysis. See Table 1 for a look of what type of statistical procedures can be run on each type.

NONCONTINOUS (DISCRETE) VARIABLES

Noncontinuous variables ('discrete' fields) are used to denote certain qualities things have, not the amount of something they have. A noncontinuous variable might indicate which patients have been admitted or not. A continuous variable would tell us how MANY times each patient was admitted.

**Dichotomous variables:** These are variables that have only two values and the values do not represent an amount of something. They signal a qualitative difference.

Pregnancy is a dichotomous variable – you are either pregnant or not, there is no middle ground. Sometimes these data are collected in 'Yes/No' fields. An analyst quickly transforms them into 1/0 or 1/2 values that can be more easily manipulated in the analytic process.

**Categorical variables:** Categorical variables are just like dichotomous variables, but there are more than two values.

Race is the classic categorical variable. There are many different races and combinations of races that we all understand to be qualitatively different. We don't say that one race is more of some underlying 'racialness' than another. We accept them as simply different. The same is true of colors or car types or hats. Like dichotomous variables, an analyst routinely transforms these values into numbers because it is easier to manage them during analysis. So for the variable 'color,' 1 might equal red, 2 might equal green, etc. But just because they exist in the data set as numbers does not mean they are amounts.

**Text fields:** These can be open-ended fields (like notes in the medical record), or structured fields that require words for data entry (like name, street and city). Text fields can be analyzed a number of ways.

Traditionally, special qualitative software has been used to code themes that reoccur in the text and explore those themes in depth. In true qualitative analysis, it is essential to preserve the specific words used during analysis, and to study the words, habits of speech, and/or the possible meanings behind them.

Technology has made other ways of approaching text data available. For example, address data are now readable with specialized geographic information system (GIS) software. This software reads a street address, and codes it with its longitude and latitude so that the precise location of the address can be used for other analyses. Natural language processing is a method of pulling patterns of words into fields that can later serve as variables in analytic procedures (e.g., what percent of encounters is the term 'DCFS' mentioned in the notes?). The vast majority of time, these new technologies transform data into categorical variables.

Table 1. General rules for using different types of data in analysis

(some of which can be broken under certain circumstances)

Type of variable |
Is it continuous? |
Can you compute percentage? |
Can you compute means? |
Can you compute 'nonparametric' statistical procedures? |
Can you compute 'linear' statistical procedures? |

Dichotomous |
No | Yes | No | Yes | No |

Categorical |
No | Yes | No | Yes | No |

Text fields (if transformed for quantitative analysis) |
Mostly no | Yes | No | Yes | Mostly no |

Ordinal |
Yes | Yes | No | Yes | No |

Interval |
Yes | If transformed into a discrete variable | Yes | If transformed into a discrete variable | Some |

Ratio |
Yes | If transformed into a discrete variable | Yes | If transformed into a discrete variable | Yes |

Note: 'Non-parametric' statistical procedures include chi-square and logistic regression; 'linear' statistical procedures include t-tests, correlation coefficients, and ordinary least squares regression (the regular ol' regression you learn first).

CONTINUOUS VARIABLES

Continuous variables are used when the amount of something matters, such as the number of admissions, the amount of money in a banking account, the number of red blood cells. These can get a little tricky.

**Ordinal variables:** When you complete a questionnaire, you are often asked how much you agree or disagree with different statements. You typically answer 'Strongly Agree, Agree, Neither Agree Nor Disagree, Disagree, or Strongly Disagree.' In this case, you are being asked the 'amount' to which you agree with the statement. That makes it a continuous variable.

But the difference between 'Strongly Agree' and 'Agree' is not the same as the difference between 'Agree' and 'Neither Agree Nor Disagree.' And that creates big problems. Actually, we do not KNOW if it is the same difference or not, and even if we knew it for you, we would not know it for others. Super big problems.

Like other types of variables, ordinal variables are often transformed into numbers for ease of analysis, with 1 meaning 'Strongly Disagree' and 5 meaning 'Strongly Agree.' But beware. That does not mean they can be used like other continuous variables in statistical procedures. They should not be. It is true they represent an amount, but since we don't understand much about the amounts, they need special treatment. Taking the average of such an ordinal variable (which SurveyMonkey does as a matter of course) is quite bad practice. These variables tend not to be normally distributed and since we cannot interpret the difference between 3 and 4, responsible analysts typically treat ordinal variables like dichotomous or categorical variables: the percent of all the agreers are compared to the percent of all the neutral/disagreers (or vice-versa).

The rule I was given in my psychometric training was that there should be at least NINE possible response options before an average could be used (assuming you ended up with something like a normal distribution among all those values). That means a survey would offer the respondent some sort of 9-point scale to choose from. There are a number of ways to do this, but it is not done often. The most frequently used approach is the five-point scale, and that should be treated in almost all cases like a noncontinuous variable.

**Interval variables:** Unlike ordinal variables, interval variables have consistent amounts between each value – like temperature in Celsius and Fahrenheit. The difference between 30 and 40 degrees is physically the same as the difference between 70 and 80 degrees. Because of that, assuming the data are somewhat normally distributed, the analyst can use means and other linear statistical procedures for interval variables.

However, there is no zero point for these temperature systems. The zero points for Celsius and Fahrenheit were chosen for other reasons beside lack of molecular movement (Kelvin has a true zero point). Thus, you could never say 80 degrees Fahrenheit is twice as hot as 40 degrees Fahrenheit. Because of the absence of zero, even though you have equal intervals, you cannot compute ratios on these variables.

**Ratio variables:** If you have stuck with me this far, ratio variables are the easiest to work with. They include a zero point and have equal intervals. Money, age, and number of days are all good examples. They can be used with pretty much all of the statistical procedures (again, presuming appropriate assumptions are met, like a reasonably normal distribution).

Each piece of data pulled from Epic has its own set of rules about how to interpret it and, also, about it can be analyzed. Confusing these very basic principles can derail any analytic plan very quickly.

Note: This topic was suggested by Susanna McColley, MD, Director, Clinical and Translational Research Program, Stanley Manne Children's Research Institute.

Subscribe now to have updates from The Why Axis delivered to your inbox.

## Comments

Let us know what you think - please leave a comment below.