※ Open Samizdat

2023-10-01

Multilingual Reading Skills of Language Models

Abstract The Belebele dataset, designed for a multilingual evaluation of language models’ reading skills, was recently released with respectable 115 languages. I have noticed that the genealogical linguistic analysis in the paper is somewhat lacking, with the authors providing almost no insight into behavior of models across language families. This oversight makes it hard to understand how various models stack up against each other. To address this, I did a simple analysis concentrating on individual languages and language families, leading to some interesting discoveries.
Although we have had massively multilingual models for several years now , I have a sense that we still mostly do not really know how well they work on the languages they claim to support. Instead, practitioners often resort to a trial-and-error approach and they try to develop an intuition about what models are best suited for their particular language. It is simply much more easier to gather training data for many languages from various text dumps than to rigorously evaluate the resulting models on all these languages.
The number of multilingual benchmarks that address this issue remains limited as well. The benchmarks often cover only a subset of languages supported by existing models, and unless the data are fully parallel, it is almost always hard to meaningfully compare results from different languages. Furthermore, these benchmarks are frequently created by scraping data from the Internet, resulting in significant noise and other issues.
The recently released Belebele dataset does not have any of these issues. It is highly multilingual and fully parallel across 115 languages, 122 if we include script variants. The authors seem to have taken great care to ensure high data quality. I also like how the task is defined — it is a reading skill task , where the model’s goal is to select the correct answer to a question based on a paragraph of text. They managed to collect 900 samples, such as this:
Paragraph: Insects were the first animals to take to the air. Their ability to fly helped them evade enemies more easily and find food and mates more efficiently. Most insects have the advantage of being able to fold their wings back along the body. This gives them a wider range of small places to hide from predators. Today, the only insects that cannot fold back their wings are dragon flies and mayflies.

Question: An insect’s ability to fold their wings back increases which of the following?

Options: 1. Food supply 2. Hiding spaces 3. Finding mates 4. Flight speed
Overall, I think that the design and execution of this dataset is pretty good. As for evaluation, they compared the performance of smaller fine-tuned masked language models (MLMs) with the zero-shot or few-shot performance of large generative language models (LLMs). They also experimented with different ways of incorporating machine translation into the mix, such as translating the English training data into all the other languages or translating prompts.
As for the negatives of the paper, I find the linguistic analysis of the results to be inadequate. The main results in Table 2 only show the numbers for (1) the English language, and for (2) the aggregation of all the other languages. To have this many languages and not check how language families behave is really unfortunate. I have recently published a paper about how important it is to perform even a simple qualitative analysis based on linguistic typology or comparative linguistics during multilingual evaluation, and how misleading using the average can be if such analysis is not performed . I thought that Belebele was a prime candidate for this type of analysis, so I decided to visualize the results from the paper to see what I could find.

XLM-V vs. InfoXLM

The paper compares three MLMs (XLM-V, InfoXLM, and XLM-R). They concluded that XLM-V is the best and they hypothesize that this is caused by its larger vocabulary size (902k vs. 250k). But if we visualize the results, we can see that XLM-V is better than InfoXLM only for the Atlantic-Congo language family, and InfoXLM is slightly but consistently better for all the other families.
WinratePerformance
XLM-VInfoXLMXLM-VInfoXLM
Atlantic-Congo88.9%11.1%46.437.2
Other21.4%78.6%63.464.9
InfoXLM dominates in most language families apart from the Atlantic-Congo family.

LLMs vs. MLMs

The comparison between MLMs and LLMs is also interesting. In the paper, they simply concluded that LLMs are worse for non-English languages. However, this is not entirely true either. Both zero-shot GPT3.5-TURBO and 5-shot LLAMA 2 outperform MLMs in 4 major European language families — Germanic, Italic, Slavic, and Uralic (from now on I will call this group GISU) — as well as in several higher-resource East Asian languages (such as Chinese or Malay). The paper calls LLMs English-centric, but I think it would be more appropriate to call them Euro-centric.
WinratePerformance
GPT3.5-TURBOInfoXLMGPT3.5-TURBOInfoXLM
GISU82.1%17.9%75.573.5
Other10.3%88.5%43.256.7
InfoXLM dominates in most language families apart from some GISU families.

Zero-shot GPT3.5-TURBO vs. 5-shot LLAMA-2

Similar pattern can be seen when we compare zero-shot GPT3.5-TURBO and 5-shot LLAMA-2. In this case, LLAMA-2 wins in Europe and GPT3.5-TURBO wins everywhere else. However, I am not sure what to make of it. The authors made a weird decision in their experiment design and there is zero overlap between models prompted in zero-shot and few-shot manner. It is not clear to me if LLAMA-2 is better in GISU languages on its own, or if it is the few-shot prompting that made it better. My assumption is that it is the latter, and GPT3.5-TURBO would win if it would be few-shot prompted.
Overall, if we are to believe the results of the Belebele benchmark, my intuition is that these are the recommended models for different language families (although with multiple exceptions):
GISUfew-shot GPT3.5-TURBO
Atlantic-CongoXLM-V
OtherwiseInfoXLM
I think that it is quite encouraging to see that there are these strong patterns in the data. That suggests to me that the results are not noisy, and that the benchmark actually managed to measure something meaningful in the models. Looking at the results, it is often the case that Europe was behaving as an outlier. I think this is fairly typical in multilingual settings. There are two reasons for this: (1) Europe is the home to many high-resource and middle-resource languages, and (2) genealogically, many of the languages are closely related to each other and to the dominant English language as well. If a method is responsive to changes in data size or crosslingual similarity, Europe might end up with an unusual benchmark behavior.

Cite

@misc{pikuliak_belebele,
  author       = "Matúš Pikuliak",
  title        = "Multilingual Reading Skills of Language Models",
  howpublished = "https://www.opensamizdat.com/posts/belebele",
  month        = "10",
  year         = "2023",
}

Comments