Yet another reason why theory should guide evaluation

The “replication crisis” has been a hot button issue in science for awhile now. Simply put, many experiments are difficult or impossible to replicate. I’m a social psychologist and so that is where I have been following the discourse. For example, many of the “classic” social psychology experiments that you may have learned about in Psychology 101 have failed to be replicated. This study suggests that perhaps we should discount two thirds of published findings in social psychology! This is especially disheartening when I  think about how many studies I read in the course of 9 years studying social psychology in university.

Roger Peng (who teaches great courses over on Coursera by the way, which is how I found his blog) recently wrote a super interesting post about this topic. Peng talks about how in fields with a strong background theory (as well as in fields that do not rely on experimental design) there isn’t a crisis.

This led me to think about evaluation and the importance of having a solid theory of change guide your work. If we evaluate a program and we don’t have a theory of change we call this a “black box evaluation.” Our results can tell us whether or not a program had an effect…but we have no idea why. Was it due to a particular component of the program? Effective staff? Something about the participants? And if we can’t answer why a program did or did not have an effect we certainly can’t replicate the program in other places.

Previous to today I had mostly thought of the replication crisis as a research problem (and one I think about when I wear my “researcher hat”) but I found it super interesting to see how it can also be an evaluation problem (and I will certainly incorporate it into my “evaluator hat” thinking!).

Developing valid self-report measures

Self-reported measures (i.e., respondents read the question and select a response by themselves) are pretty common in evaluation. They are relatively cheap and easy to administer to a large group of people. It’s a lot easier to email a survey link than it is to hire and train a team of research assistants to follow and observe your participants and record their observations.

Some purists are quick to dismiss self-reported data. Studies have shown that people are not very honest when it comes to self-reporting their college grades, height and weight, or seat belt usage, among other things. Some problems with self-report data include:

Social desirability bias: Self-report measures rely on the honesty of your participants. “Social desirability bias” is a fancy way of saying, generally speaking, people want to present themselves in the best light possible. If your survey is asking about a sensitive topic, such as exercise frequency, eating habits, or alcohol consumption, participants might not be truthful in their responses. One way to combat this is to make questionnaires anonymous.

Understanding and interpretation: Self-report measures also rely on participants understanding your questions and the available response options. If your survey item is being misunderstood, your resulting data isn’t going to tell you much.

Memory: Even if a participants are being honest and they perfectly understand your survey questions, the quality of your data is also dependent on participants accurately remembering pertinent details. Human memory is a lot worse than people generally realize.

Response bias: Several other factors can influence how a participant responds to a question. If you are in a good mood, you may be more likely to answer the question positively. The reverse is true as well – a bad mood can predispose you to answer a question negatively. Even your personality can influence how you answer a question!

Yikes, those are some serious problems. So what does this mean for evaluators? Given their many advantages, self-report measures are not going anywhere anytime soon. Thankfully there are some steps we can take to increase the validity of our surveys:

1. Pilot test your measures: Before you “go live” with your survey with your participants, you should pilot test your questionnaire with a small number of people (in a perfect world, this small group of people would be similar to your actual participants. So if your survey is designed for youth, you should be pilot testing it with youth, and so on). As part of your pilot test, you should conduct interviews in order to ensure your items and response options are being interpreted correctly.

2. Make your survey anonymous: Anonymity can encourage participants to be honest. It can also help if the evaluator leaves the room and participants are given privacy while completing the survey. Of course sometimes we need a way to be able to track surveys, as is the case if you are doing a traditional “pre-post” design (you will be matching a participant’s survey from before the program with a survey completed after the program). In this case, a random ID number can be used, although this can add a layer of complexity in your data management.

3. Counterbalance your measures: Counterbalancing means randomizing the order survey questions appear. It could mean the order of every single section is randomized (in which case you would have a lot of different versions of the survey), or it could be as simple as splitting the survey in half and reversing the order with some participants randomly receiving the first version and the others receiving the second version. You might use the two version method with a paper survey but if you are using an online survey, many of the main online survey providers offer ways to randomize question order making it easy to have many different versions.

How about you – do you ever worry about the validity of the self-reported data? If so, what are some techniques you use to increase the quality of your measures?

As a side note, I wonder if self-report measures will be less common in the future, particularly in the realm of health and exercise. ‘Wearable tech’ devices are becoming quite common (check out how these devices are used by Disney) and keep coming down in price. The devices can capture a tremendous amount of data and how exactly they can be used in evaluation will be fascinating. If anyone has an example of an evaluation that used wearable tech data I’d love to see it!

Measuring attitudes that predict behaviours

It’s pretty typical to come across surveys asking about attitudes in evaluations. These survey results are often (not always) used to make inferences about participants’ behaviours. How valid is this approach and are there ways to structure attitudinal questions that are more likely to predict behaviour?

In a lot of circles, it is accepted wisdom that attitudes don’t predict behaviours. The classic study in regards to this is LaPiere (1934). LaPiere, a sociology professor at Stanford University, spent two years traveling in the U.S. with a Chinese couple. Over the two years, they visited 251 hotels and restaurants and were treated hospitably at all but one. LaPiere found this surprising, and when he returned home he mailed a survey to all of the businesses visited asking:  “Will you accept members of the Chinese race in your establishment?” Of the 128 businesses that responded, 92% answered no. This study was seminal in establishing attitudes don’t match behaviours and is still talked about in undergraduate social psychology and sociology classes.

Over the years it has been debated if LaPiere’s study truly shows a discrepancy between attitudes and behaviours or if it simply shows that often surveys only measure general attitudes (e.g., in general, would you allow members of the Chinese race in your business?) rather than specific attitudes (e.g., would you allow this specific Chinese couple in your business?), with specific attitudes being more likely to predict actual behaviour.

This notion is related to what is known as the Theory of Compatibility (Ajzen & Fishbein, 2005). Simply put, this theory states that attitudes are more likely to predict behaviour when they are measured at the same level of specificity. For example, general attitudes toward organ donation are quite positive, but the actual number of people who register as donors is low – a discrepancy that has frustrated and confused researchers. But when Seigel et al. (2014) asked about attitudes specific to registering as a donor, they found that they could explain over 70% more of the variance in actual registration rates. A meta-analysis of over 88 studies provides further evidence: when the theory of compatibility was adhered to, the average correlation between attitudes and behaviours was r = 0.50. When it wasn’t, the correlation was only r = 0.14.

So what does this mean for measurement in evaluation? First, as with most measurement questions, I would suggest looking at the theory of change. What is the program actually trying to accomplish – a change in attitudes or a change in behaviour or both? Often, there is the assumption that providing participants with knowledge on a topic (e.g., what are healthy eating habits) will result in attitude change (e.g., “I should eat more healthy foods”) which will then result in a behaviour change (e.g., the participants increase their intake of healthy food) – this is known as a results chain.

Keeping with the above example, let’s say you are measuring the impact of a healthy eating workshop and you will be delivering a survey immediately following the workshop. This means that you can’t assess the impact on behaviours – your only options are knowledge and attitudes. How can we use the theory of compatibility to increase the chance that our attitude questions will actually predict behaviour? Rather than asking about general attitudes toward healthy eating (e.g., “How important do you think it is to eat healthy foods?”), we should be asking about specific attitudes (e.g., “How important do you think it is for you to eat 7-8 servings* of fruits and/or vegetables per day?”).

I’m curious about how others approach this in evaluations. Do you generally measure attitudes or behaviours or  both?

*For the sake of this example, I used the guidelines from Canada’s Food Guide for an adult female, although that resource is certainly not without its controversy.