⚠ COVID-19 INFORMATION: Vaccine Information, Other Resources 

What is the unit of analysis and why should I care?

Analysts have words for things that no one else even thinks need words.  A couple of my favorite terms in analyst-speak are unit of analysis and unit of observation.  I use them a lot and people stare at me when I do.

Not just jargon

The unit of analysis is the entity being studied; the unit of observation is the entity you are collecting data from.  They can be the same thing, but often are not.  Here is an example:

Study question:  What does it take to increase provider compliance with a new clinical care guideline?

An example of when the unit of analysis is the same as the unit of observation: Providers report their views about the care guideline and their willingness to follow it in a survey or focus group.

An example of when the unit of analysis is NOT the same as the unit of observation: Data are pulled from Epic based on clinic visits, patients, or procedures to assess whether the clinical care guideline was or was not followed in each case.  The observations are aggregated up to the provider so that each provider has his/her own compliance rate.  The provider is the unit of analysis, since it is his/her behavior that we want to learn about. 

The unit of observation is the source of data that describes your unit of analysis. In the diagram below, there are four levels of possible observation and analysis – the individual, the provider, the clinic and the hospital.  We could add more levels, for sure.  But in research and quality improvement work, we are often moving between these layers of activity.

Fig. 1: Possible units of analysis and observation in many of our research or quality improvement projects


Why it helps to know the difference

First, as you conduct statistical tests, the sample size you need is based on the unit of analysis, not the unit of observation.  In our example about provider compliance with clinical care guidelines, we would need to collect data on fifty providers in order to judge whether we are changing provider behavior.  But the data for the fifty providers might be the aggregation of thousands of patient visits.  If our goal is to know whether our patients experience clinical care guideline compliance, then an overall rate is appropriate (we do not need to know what individual providers do). 

Second, when the unit of analysis and unit of observation are different and we are confused about them, certain mistakes may get made:

(a) We put together a dataset that cannot be analyzed at all because it mixes different units of analysis in an unsystematic way. There are statistical procedures that work with 'mixed' data, but the data sets have to be set up properly from the start.

(b) We draw conclusions based on the unit of observation only and because the sample size for this group tends to be very large, our statistical conclusions are misleading (that is, they would be more likely to be statistically significant than the findings based on the unit of analysis).

(c) We commit what is called 'the ecological fallacy,' in which we draw conclusions about the units of observation by studying the unit of analysis.  For example, if a provider is 70% compliant with the clinical care guidelines and 30% of her patient visits were at geographically distant clinics, we might want to conclude that it was mostly the visits at the distant clinics for which the guidelines were not followed.  But we would have no evidence of that.  We would need to analyze the individual observations to determine whether that is true. The ecological fallacy occurs when we assume things about individuals based on group-level data.

Oh, and then it is gets messy

Figure 1 suggests a neat distinction between different possible units of analysis and units of observation.  But that assumes that different levels of units of analysis are truly independent of each other. That assumption usually does not hold. 

Figure 2 demonstrates one perspective on the social hierarchy around children as they develop (this is used by Developmental Systems Theory).  To study the child, you might be interested in using the family or peer group as the unit of analysis.  And it is not hard to imagine interplay among all of these levels. 

Fig. 2: Developmental Systems Theory model of child development


For example, in one study I was lucky to work on, my wonderful colleague, David Henry, used peer groups as the unit of analysis to study aggressive behavior in children. We gathered data from third grade children in many classrooms.  The children were surveyed about their own behavior and the norms of their classroom (or peer group). David was able to show that the level of aggressive behavior by children was driven significantly by peer group norms and that if you could change the norms for the peer group, you could change the behavior of the child to some degree.

He called it 'the return potential of aggression,' that is, some groups reward aggressive behavior and in doing so motivate children to act more aggressively towards one another than they would otherwise – there was a clear interplay between the peer group and the individual child. If the study treated the child as the unit of analysis, we would have missed this really important finding.

Thus, part of the challenge in choosing which unit of analysis you use is struggling with where the action is – whose behavior you are trying to change, what triggers your QI project is trying to pull and who would be responsible for reacting to them – knowing that in reality there is action at more than one 'level' and probably some interplay between them.  David had a very educated hunch that there was something going on in the peer group that had not yet been explored fully and that was driving individual behavior, so he focused on that.

Some studies attempt to measure each level and determine which place of action is the most effective or interesting after the fact.  A study I oversaw years ago was exactly like this – it took place in schools and we collected data from students, teachers, and schools.  Our original unit of analysis was the student, but that changed once we had our data.  The sampling plan allowed us to switch to using the school as the unit of analysis because it was very robust, and indeed that is where we found the most interesting differences.  We then were able to 'control' on certain characteristics of the teachers and students and explore how different types of schools enable a better teaching process for teachers and a better learning process for students.  It was very cool. 

Obviously, these nested models can become complex quickly.  And that is why it is so important to be clear in your mind and to specify in your analytic plan what your unit of analysis is and how that is the same or different than your unit of observation.  It is one of those tough, complicated decisions that you and your analyst will struggle with.  Getting through that struggle pays big dividends when you are in the throes of trying to figure out what the data have to say to you.

Subscribe now to have updates from The Why Axis delivered to your inbox.