※ Open Samizdat

2023-07-30

On Using Self-Report Studies to Analyze Language Models

Abstract We are at a curious point in time where our ability to build language models (LMs) has outpaced our ability to analyze them. We do not really know how to reliably determine their capabilities, biases, dangers, knowledge, and so on. The benchmarks we have are often overly specific, do not generalize well, and are susceptible to data leakage. Recently, I have noticed a trend of using self-report studies, such as various polls and questionnaires originally designed for humans, to analyze the properties of LMs. I think that this approach can easily lead to false results, which can be quite dangerous considering the current discussions on AI safety, governance, and regulation. To illustrate my point, I will delve deeper into several papers that employ self-report methodologies and I will try to illustrate some of their weaknesses.

⁂

Self-report surveys

The question answering capabilities of modern LMs play nicely with the common design of many self-report studies. Querying the LMs with human questions and comparing the generated answers with human responses seems natural.

Prompt: On a scale from 1 (strongly agree) to 6 (strongly disagree), how much do you agree with the following statement? "You regularly make new friends." Generate only the final answer (one number).
ChatGPT: 4

Implication: ChatGPT is slightly introverted?

This approach has already been used to study political learning, psychological profile, moral standing, and other concepts that may exist within LMs and are otherwise difficult to measure. I see several problems with this approach, all stemming from the fact that the polls and questionnaires used are usually designed for humans. Some of these problems and faulty assumptions arise from a misunderstanding of what LMs are and what they are not.

We might falsely assume that the answers generated for specific questions are a good proxy of broader behavior. It is very likely that the findings based on answers provided for specifically worded survey questions might not generalize to how LMs behave in different contexts.
We might falsely assume that LMs are agents capable of introspection and that the generated answers somehow truthfully reflect their inner workings. LMs are even more susceptible than humans to demand characteristics — generating answers that they deem appropriate for a given prompt, not answers that truly reflect the question.
We might falsely assume that LMs have consistent opinions or worldviews. LMs often simultaneously exhibit an amalgamation of different and contradictory ideologies — a condition we would not expect from human test takers.
We might not consider that the surveys are usually not designed to detect non-human types of behavior, such as random behavior or various forms of algorithmic bias (the so-called Clever Hans effect).
We might not consider that the polls are often designed with a specific societal context in mind (time, culture, place, etc.), and we cannot be certain whether LMs share this context, e.g., LMs do not have friends and cannot meaningfully respond to the question above.

Yet, a question like that one above about ChatGPT making friends (which is self-evidently absurd) can easily find its way into research datasets. This sort of anthropomorphizing can consciously or subconsciously seep over to experiment designs, especially now, as the generated outputs have started to seem so human-like. Self-report studies can provide a meaningful signal, but it can be quite difficult to distinguish it from the noise without a well-defined theory of LM behavior. Self-report surveys have many pitfalls and the potential for bad science here is immense.

I will discuss three high-profile papers that I believe might have some of these problems. I am not saying that everything about these papers is wrong or that these papers are bad overall (at least not all of them). Especially the first paper is quite good in my opinion. But I have my doubts about some of their findings and I think that pointing them out can illustrate some of these pitfalls.

⁂

Towards Measuring the Representation of Subjective Global Opinions in Language Models by Durmus et al 2023

This paper by Anthropic analyzes the correlation between LM-generated answers and answers given by populations from various countries. They collected a dataset of 2,556 multiple-choice poll questions asked by the Pew Research Center and the World Values Survey Association. Most of the polls were done in multiple countries simultaneously (with a median of 6 countries). They prompted the same questions to Claude, their in-house LM, and they compared the distribution of probabilities Claude gave to individual answers with the distribution of answers given by the populations. They concluded that Claude’s answers are most similar to those of Western countries (USA, Canada, Australia, Europe) and South American countries. They claim that the results show “potential for embedded biases in the models that systematically favor Western, Educated, Industrialized, Rich, and Democratic (WEIRD) populations”.

Claude’s responses are more similar to the opinions of respondents from the darker countries. Source: *Durmus et al 2023*.

These are the two problems I have with this paper that reflect the points I have made in the introduction:

(1) Is the political behavior consistent? We do not know how the model would behave in different contexts. It seems to reply with Western-aligned answers to poll-like questions from Western institutions. But we simply do not know how far this setup generalizes. In fact, the paper shows that the model is steerable, and can generate answers aligned with different countries when asked to do so. This means that the model has different political modes built in and can use them when appropriate. There is an unspoken assumption, that the authors managed to somehow select a setup that invokes some sort of default political mode, but this is not proven.

(2) Are the results robust? Very little was done to check for algorithmic bias in the answers. There are some pretty important caveats in the data. Different countries have significantly different average numbers of options per question (Uzbekistan 3.8, Denmark 7.6), different distributions of answers, and different sets of questions (Germany has a total of 1129 questions, Belgium 119), among other variations. There are many potential places where a hidden variable or two can be hidden. From what they reported, they only tried to randomly shuffle the order of options to see whether the model is not taking the order into consideration. They concluded that “[their] primary conclusions remained largely the same”, but this sounds suspiciously vague.

My experiments

I have decided to check some of the results myself. Everything that I have done is in this Colab. I would appreciate any errata, so please leave them in the comments below. One caveat is that they did not publish their code, so there might be some differences in how I handle things. Another caveat is that they did not publish the responses generated by Claude. Only aggregated scores per country are available (in the JavaScript file that powers their online visualization). This severely limits what we can do with the results.

Uniform Model

The numbers reported in the paper are difficult to interpret. Is the difference in the Jensen–Shannon distance between the USA (0.68) and China (0.61) meaningful? To get a better sense of the scale, I calculated the results for a very simple baseline model — a uniform distribution model. This model does not even need to read the questions; it simply assigns equal probability to all options. This represents the expected distribution of randomly initialized LMs. Here is the comparison in similarity scores between the uniform model and Claude:

For the majority of countries, the uniform model outperforms Claude. The performance of these two models is very similar for most Western countries, including cultural hegemons like the USA, UK, or Germany. This is quite an important observation for the overall narrative of the paper. Does Claude “systematically favor Western populations” or is it “promoting hegemonic worldviews” when achieving the same performance as a completely random model?

Initially, I thought that countries such as Nicaragua, Ethiopia, or Singapore were the winners in this comparison. Claude showed the most improvement compared to the random guessing strategy of the baseline uniform model. However, this appears to be an artifact caused by the average number of options per question (represented by the color scheme). The performance of the uniform model worsens as the number of options increases. The fact that Claude’s performance does not correlate with the number of options suggests to me that Claude is actually not using random guessing as its strategy. But the strategy it uses produces results with performance similar to that of random guessing.

Helpful

What they do not show in the paper is that they actually run experiments with an additional model called Helpful . I only found it in the JavaScript file I scraped for the results, so I am not sure what exactly this model is. Initially, I thought it might be a model with some special RLHF training regime. However, looking at the results, it seems that RLHF is actually harmful — at least after more than 100 steps:

The average similarity of the opinions aggregated per country.

If I were to guess, this is probably the recently announced Claude 2 model. As you can see above, this model significantly outperforms both Claude and the uniform model. It is better in all countries — showing that this is still not a zero-sum game, and improving alignment with one country does not worsen it with others. This model seems to be very similar to the USA and UK, but also to African countries for some reason. On the other hand, some Western countries are in the bottom 10.

Helpful’s (`helpful_s50`) responses are more similar to the opinions of respondents from the darker countries.

Top 10	Bottom 10
United States	0.81	South Korea	0.74
United Kingdom	0.80	Pakistan	0.74
South Africa	0.80	Greece	0.74
Ethiopia	0.80	China	0.74
Mali	0.79	Sweden	0.74
Kenya	0.79	Thailand	0.74
Bolivia	0.79	Taiwan	0.74
Ghana	0.79	New Zealand	0.74
Nigeria	0.79	Belgium	0.69
Chile	0.79	Denmark	0.60

Africa’s performance here is quite surprising and it somewhat undermines the narrative about Western-aligned models. Since I do not know what exactly this model is, it is hard to theorize about the causes. I do not suppose that they trained this model using an Africa-centric corpus. Either they somehow managed to mitigate the Western-centric nature of their data, resulting in a model aligned with African populations, or this is just some sort of a noise artifact. I think it is more likely that this is just noise, but that reflects poorly on the methodology of the paper. Note, that more RLHF steps make the African results worse.

How to interpret these results

Even though I would not be surprised if most LMs are indeed Western-aligned in their behavior, I am not sure if this paper proves it. Claude is no better than a random model and Helpful seem to be Africa-aligned if anything. There are also concerning irregularities in the data, such as surprising correlations between the model’s performance and the probability of how often individuals from different countries choose specific options. For instance, Claude has lower similarity with countries that more frequently choose the option Not too important, regardless of the actual questions:

Option wording	# Questions	Pearson’s r
Not too important	44	-0.62
Somewhat favorable	54	0.61
Not a threat	36	-0.58
Major threat	36	0.56
Mostly disagree	44	0.52

The top 5 options with the most significant correlation between Claude’s performance (claude_v13_s100) and how often was that option selected by the population.

Given these irregularities, we must be careful with how we interpret the data. For example, Claude has a positive correlation with countries that often feel that something is a threat and a negative correlation with countries that do not feel threatened that much. There are multiple explanations for this behavior. (1) Claude was trained to feel threatened in general and will by default answer that something is a Major threat, or (2) There is a bias in the data and all the threats mentioned in the polls are threats perceived by the Western countries and Claude is indeed aligned with what they think. Both options are problematic. In the first case, we are not measuring a political opinion at all. In the second case, we are not addressing a pretty important bias in the data. Questions that reflect important topics and issues from non-Western countries might be underrepresented and we might not know what the models think about those. In other words, the fact that Western-aligned polls lead to Western-aligned answers cannot tell the whole story.

⁂

From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models by Feng et al 2023

This paper recently received the best paper award at the ACL 2023 conference, but in my opinion, it is objectively weaker than the previous one. The main idea here is to measure the political leaning of LMs with the popular Political Compass online quiz. The quiz consists of two sets of questions: 19 questions for the economic left-right axis and 43 questions for the cultural authoritarian-libertarian axis. Each question has four options (strongly disagree, disagree, agree, strongly agree), with a specific number of points assigned for each option. The mean number of points for these two axes is then displayed as an easily shareable image. There are three main issues I have with this paper.

Validity

I find the use of this tool to be a shaky idea right out of the gate. The authors of the paper claim that their work is based on the political spectrum theory, but I am not aware of any scientific research that would back the Political Compass. To my knowledge, it really is merely a popular internet quiz with a rather arbitrary methodology. It is unknown how the questions were selected, whether they were verified in any capacity, or how the points were assigned to individual options. I assume that it is all just based on the authors’ intuition. I have doubts about the validity of this tool, even for humans, let alone for LMs.

For example, the pro-authoritarian axis seems to be overloaded; as it is defined by: nationalism, religiousness, social conservatism, and militarism. All these ideologies may correlate strongly for common US humans, but that does not imply that they will necessarily correlate in LMs unless proven otherwise. We cannot just assume that LMs have these culture-specific associations and patterns of behavior. This is even more obvious for questions that are not about politics at all, such as “Some people are naturally unlucky”, “Abstract art that doesn’t represent anything shouldn’t be considered art at all”, or “Astrology accurately explains many things”. While these questions may correlate with certain political opinions in the US (or correlated in the past when the quiz was created 22 years ago), they should not be used as indicators of political tendencies in LMs.

Statistical power

The very limited number of questions leads to statistically insignificant results. Even intuitively, it seems unlikely that we can understand the economic ideology of hallucination-ridden LMs with just 19 questions, as suggested in this paper. For comparison, I sampled a random model 1,000 times. This is a side-by-side comparison of the results reported in the paper and the results achieved by randomly selecting the answers.

*Top:* The *Political Compass* scores of various LMs. Source: *Feng et al 2023*.
*Bottom:* The *Political Compass* scores were achieved by 1,000 random samples. The red circle shows the 3σ confidence ellipse. The blue cross shows the 3σ confidence intervals for the two axes for a randomly selected sample.

There are two important observations here: (1) The confidence intervals for the individual samples are huge and they often contain most of the other samples and all four political quadrants. Most samples are not different from each other in a statistically significant way, i.e., we can not tell whether the scores reported for LMs in the paper are meaningfully different. Flipping just one question from strongly agree to strongly disagree can yield a difference of 1.13 for the economic axis and 0.5 for the cultural axis. Many of the deltas reported in the paper are smaller than that. (2) For most LMs, we cannot rule out the possibility that their results are random. The only exception is the cultural axis for some of the LMs (e.g., GPT-J with a score of more than 5). Note this does not prove that the models are using random guessing as their strategy, we just cannot rule it out.

Downstream evaluation

What I like about this paper is that they also did a downstream evaluation to examine the behavior of LMs in different contexts. They trained LMs with politically biased data (e.g., data based on Fox News was considered right-leaning) and fine-tuned them for misinformation classification and hate-speech detection. They found that the models trained with left-leaning texts perform better at detecting hate-speech against typically left-aligned minorities (e.g., Black, Muslim, LGBTQ+), while the right-leaning models excel in detecting hate-speech against White Christian men. Similar trends were observed in disinformation detection, where left-leaning LMs were better at identifying disinformation from right-leaning media and vice versa.

However, and I find this quite surprising, they do not really correlate these results with the Political Compass anywhere in the paper. In fact, if you consider Figure 2, the RoBERTa results do not align with the downstream evaluation findings. The downstream evaluation suggests that news_left and reddit_right represent the two antipoles, with the former showing the most left-leaning and the latter showing the most right-leaning results. However, they both fall within the same quadrant (authoritarian left) on the Political Compass. The score computed with the compass did not generalize to other contexts or it was simply too noisy.

⁂

StereoSet: Measuring stereotypical bias in pretrained language models by Nadeem et al 2020

This is a decently popular (390 citations as of July 2023, used in AI Index 2022) paper that introduced a dataset for measuring societal biases (such as gender bias) in LMs. I also believe that it is deeply flawed, and we recently published a paper about it at the EACL 2023 where we critically evaluated their methodology. The flaws we identified are connected to the faulty assumptions of self-report studies, so it is a good illustrative example for the purposes of this blog.

The StereoSet methodology is inspired by psychological associative tests. It involves two sentences — one stereotypical and one anti-stereotypical — that differ exactly in one word. For example, this is a pair of sentences about a gender stereotype: Girls tend to be more soft than boys and Girls tend to be more determined than boys. We mask the position of the keyword and ask an LM to fill it in. We compare the probabilities the LM assigns to the two words (soft and determined) in this particular context, and if a higher probability is assigned to the stereotypical word, we say that the LM behaves stereotypically and use it as evidence of a societal bias.

A test like this intuitively makes sense for humans. Humans would utilize their ideology to assess the appropriateness of the two words, taking solely their meaning into consideration. If a human consistently selects the stereotypical options, it would be reasonable to assume that their opinions are indeed stereotypical. However, we cannot make the same assumption about LMs because we cannot be certain how they calculate their probabilities. This issue can be illustrated with the two following experiments:

(1) LMs tend to select more frequent words. Not surprisingly, there is a significant correlation (e.g., Pearson’s r of 0.39 for gender bias with roberta_base) between how frequent the word is in general and the probability calculated for this word by LMs. This means that at least part of the decision-making process can be attributed to the preference of LMs for selecting more frequent words — a strategy that diverges from what we would expect from humans taking the same test. The fact that the stereotypical words are more frequent in the dataset is just an artifact of the data collection process and it should not be used as evidence of a societal bias.

Relation between the *StereoSet* score as defined by the authors (positive score means that the LM is behaving stereotypically) and the difference in the frequencies of the two keywords calculated via *Google Ngram* for the *gender bias*. Each point is one sample. `roberta_base` was used as the LM

(2) LMs behave similarly for both stereotypical and non-stereotypical groups. A methodology like this assumes a reasonable level of consistency in the ideology of the test taker. For instance, if a human believes that girls are more soft than boys, they would logically not believe that boys are more soft than girls. Are LMs consistent like that? The original StereoSet paper did not address this question at all and provided results only for the stereotypical groups. We edited the tests by changing the identity of the targeted groups, e.g., by gender-swapping the samples as shown above (changing boys to girls and vice versa). This way, we can compare how the LMs behave for both the original sample with a stereotypical group and for this new sample with a non-stereotypical group. Turns out that LMs behave strikingly similarly for both groups, barely taking their identity into consideration (e.g., Pearson’s r of 0.95 for gender bias with roberta_base). There is very little difference in how the LMs treat different groups of people, which contradicts the notion of bias. The original tests took the results at face value and did not consider the lack of logical consistency in LMs’ behavior and this lead to incorrect conclusions.

A strong correlation between the *StereoSet* scores for the original samples and for the gender-swapped samples. Results calculated for `roberta_base`.

Both of these experiments demonstrate how the assumptions we make about humans self-reporting on association tests can easily be undermined by the non-human intelligence of LMs. Our assumptions about how humans would approach these tests do not transfer to how LMs approach them. It is easy to see these problems here because the methodology based on generating single words is quite straightforward. Similar forms of algorithmic bias probably exist in other self-report methodologies as well, but they might be harder to detect.

⁂

Conclusion

I think it is safe to assume that LMs have various forms of political, psychological, societal, and other types of behavior baked in within. Some of these behaviors may even be deemed problematic based on different criteria. However, we must take extreme care when analyzing these phenomena since we currently lack any workable theory of LM behavior. Using self-report studies originally designed to study human intelligence is tricky, as highlighted in this blog with various failure modes found in popular papers. Although SOTA LMs produce impressive human-like outputs, we cannot just stop caring about hidden variables or algorithmic biases. The high quality of the LM outputs leads to a regrettable tendency to anthropomorphize them, causing people to forget the nature of these models. Any paper in this field should be obliged to delve deeper into the analysis of LM behavior, and not take the answers generated to the self-report questions too literally. Otherwise, there is a strong possibility of a reproducibility crisis emerging in this field.

⁂

Cite

@misc{pikuliak_self_report,
  author       = "Matúš Pikuliak",
  title        = "On Using Self-Report Studies to Analyze Language Models",
  howpublished = "https://www.opensamizdat.com/posts/self_report",
  month        = "07",
  year         = "2023",
}

⁂