⚠ COVID-19 INFORMATION: Vaccine Information, Other Resources 

Developing a scale to measure a latent trait (or how to ruin a really good party)

I should begin this post by the warning 'don't try this at home,' because so much knowledge, judgment and technique goes into scale development that I don't want anyone to think they are qualified to measure anxiety and depression in preteens based on my quick overview of how to develop a survey scale. On the other hand, the process used today is actually pretty much the same as the one that I was taught a LONG time ago as a psychology undergraduate. And my fellow undergraduates and I learned all those years ago that, if you are clever, you can make a party game out of it (actually, kind of out of any homework assignment, if you are hungry enough).

So at the risk of being scolded by many researchers I respect a great deal, I am going to emphasize the party game side of scale development to make this post a little fun and to help those who will never do this for a living get a flavor of what is involved. If you decide you want to do this for a living, I have some reference materials to share with you, a couple graduate programs to recommend, and some really great mentors to send you to visit (if they are still talking to me).

Prep time

First, invite a really large group of people over for dinner, in shifts. An open house arrangement is ideal because you actually want different people correcting/commenting on the work of the other groups.

Second, decide what latent trait you want to measure. You can pick anything from anxiety to authoritarianism. For the sake of keeping this post polite, let's say you decide to develop a scale for helpfulness – something you cannot measure accurately by direct questioning (that is, a latent trait).

Third, decide what specific approach you want to take that latent trait. For example, if you are measuring something in the general population, your approach will be different than if you are measuring something for a small clinical population. This is the equivalent of planning a dinner party and deciding you absolutely have to make the coq au vin Julia Child's way (the analogy is measuring a pathological form of helpfulness, no offense to Julia Child-lovers). If so, then you might want to invite only Julia Child experts (that is, experts in treating pathological helpfulness). But if you really only want to generate ways of cooking really good chicken (that is, a general disposition of helpfulness among many different types of people), then you want to hear from a whole lot of chicken-cookers (invite a lot of people who have some interest in the idea of helpfulness).

The Party

Appetizers (or Shift 1: Question generation). Ask each guest to write ten questions that would tell us something about the helpfulness of the theoretical person answering the survey. Each question goes on its own index card. The questions might be something like 'How often do you offer to help with the dishes at the end of the party?' or 'How often do you pick up after yourself?' Let's say you have 100 index cards at the end of Shift 1. The more the better.

Dinner (or Shift 2: Content adequacy and face validity). While you are waiting for Shift 2 to arrive, copy the questions onto additional index cards so that every guest in Shift 2 has a full set (you will need a LOT of index cards, and super-human copying speed). Each new guest will then sort the cards in different piles based on how well they each think the question taps into the issue of helpfulness. For example, some questions might seem to touch on friendliness or hospitality rather than helpfulness per se.

Your guests will tell you which questions really narrow in on helpfulness; it is not your job to tell them. But there will probably be a number of long debates about the difference between being helpful and being obsequious or a tattle-tail. That is the fun part. As the cards get sorted, people start getting clearer about what the essence of helpfulness is, as well as what is missing from the index cards. If your guests feel that there are elements missing, they can add their own index cards (this assures 'content adequacy' – you are covering the full meaning of helpfulness).

You may also wonder if some of the piles tap into different aspects of helpfulness (e.g., is hospitality really just a certain manifestation of helpfulness, or is it something independent?). If you all agree some concepts are linked together, you might want to develop subscales. For example, if you decide hospitality and willingness to do chores for friends are both forms of helpfulness but different in key ways, you could include the index card stack for each of those traits as subscales to the overarching helpfulness scale.

In the end, you will have a pile of cards that 'on the face of it' look to tap into helpfulness (what is call 'face validity'), or you will have a couple of groups of cards that have face validity for whatever subscales your friend shave decided are important.

Dessert (or Shift 3: Develop your initial questionnaire). While you are waiting for Shift 3 to come, take all the helpfulness cards or subscale cards and create a questionnaire. If you have one or more subscales, note on one draft which questionnaire items belong to which scale (you will need this later; do not share it with the questionnaire respondents).

I am skipping a whole section on response types and theories about questionnaire responses. Let's assume for now that all of your questionnaire items can be answered on the same 5-term scale (Never, Rarely, Sometimes, Often, Always). Also, I am skipping issues like item ordering, reading level, good question design, etc.

Administer the questionnaire to your Shift 3 guests. It is easiest if this is done online so that you don't have to waste time doing data entry. Tell your guests to leave.

OK, they can stay, but you have work to do at your computer.

It is time to factor analyze your scale. Factor analysis is statistical procedure that looks for patterns in how individuals answer questions. If we are to measure an underlying trait like helpfulness successfully, respondents who are similarly 'helpful' should respond to a given set of questions in a similar way. Respondents with different levels and different kinds of helpfulness will respond differently. Factor analysis detects these patterns, as well as some you may not have expected. It is during factor analysis that subscales will begin to appear. You can check to see if these make sense against the question groupings that your guests made (the questionnaire draft you noted the subscales on).

You will probably find that a certain number of your questions do not add anything to the scale (they will have very low 'factor loadings') and you can drop these items from the scale. Once you have the items that seem to fit together as a scale, you can run a reliability statistic that tests the internal consistency of the underlying patterns. Factor analysis has already told us whether there are patterns in the ways respondents answer questions. Reliability analysis tells you how consistent the patterns are. If they are very consistent, you have something to go with.

After dinner drinks (or Shift 4: Construct validity). So now you have a survey scale, perhaps 10 or 12 questions, that you believe tap into the underlying trait of helpfulness. It is time to test whether that is true, because they may actually be manifesting another trait that just happens to be very closely associated with it.

You will administer another survey to Shift 4. This survey will include your new scale, plus several other validated scales that manifest traits that are similar to or opposite of helpfulness. For example, although friendliness is different from helpfulness, you would expect friendly people to be more helpful than unfriendly people. So you might include a friendliness scale. On the other end of the spectrum, you might expect people who score high on frugality to not be helpful. So you might include a frugality scale that has been validated.

You will also include behavior questions about the last time the respondent offered help when they were not asked, when they last helped a stranger, etc.

What is important here is that the data you collect on behavioral items and other scales will help you understand what your scale is tapping into that is unique. If your scale is manifesting the latent trait of helpfulness, there should be a strong association between their score on the helpfulness scale, their behavior, and their scores on scales measuring related traits. The other items in the survey essentially 'validate' that you are manifesting the construct they are not – the latent trait of helpfulness.

This type of validity is called 'construct validity' – it means that indeed you are actually measuring the thing you think you are measuring. If the correlations and associations you hypothesize are there (and if there is a good theoretical and empirical foundation for you to pose those hypotheses in the first place), you have scale!

Latecomers who did not read the invitation (or Shift 5: Replication). After you have analyzed the scale and determined it has construct validity, it is time to replicate it.

Replication needs to happen with an independent group (one you have not surveyed before) and the more different groups you field your survey to, the better you will understand your measure. For instance, you might in time learn that for some cultural groups, your notion of helpfulness does not appear to have the same validity. This can signal a problem with the measure, or give you insight into its covariates (traits related to it that might manifest differently for different subgroups of the population – like people who don't read invitations).

Replication is key to interpreting your scale, and makes it more widely accepted by others in your field who need to measure the latent trait you are measuring. If your scale has gotten a lot of airplay with many different groups and continues to show reliability and validity, researchers and clinicians will develop deeper confidence in what it tells them.

To summarize -

The key steps to good scale development are:

1. Prepare ahead (know what you are doing, why and how)
2. Generate candidate questions
3. Establish content adequacy and face validity for a subset of questions
4. Field a questionnaire to gather initial data
5. Identify items that hold together as a true scale and identify subscales (factor analyze the items)
6. Establish internal consistency of the scale items (reliability analysis)
7. Field a revised questionnaire with the refined scale, related behavioral items and other scales for well-established, related latent traits
8. Establish construct validity
9. Replicate and continue to learn about your scale

So there you go. That is the very basic, no frills version of survey scale development. There are other types of validity and reliability related to latent trait measurement that scientists test for, and the process can be quite protracted because the scale may require adjustment multiple times. Every time a single item changes in a scale, validation and reliability need to be re-established.

As I said at the outset, don't take this post to the bank. But if you ever witness my silent, frustrated twitch when someone tells me (or a real psychometrician), 'Let's just change this scale a little,' you'll understand what it means – that all of the scale building, validation, reliability testing, and replication work is being tossed out the window on the whim of someone who was not invited to the party in the first place.

Thanks to John Lavigne, PhD, for his review and feedback on this post. 


Subscribe now to have updates from The Why Axis delivered to your inbox.