Paper Summary:

The paper introduces a dataset, GEST, designed to examine gender biases in language models (LM) and machine translation (MT) systems across English and nine Slavic languages. It includes 16 categories of gender-related stereotypes, curated with the input of sociologists, with gender-neutral English sentences as the base. The authors evaluate various models using metrics they developed to measure stereotypical reasoning and the tendency to generate all-male cases.

Considering the strengths and weaknesses highlighted by multiple reviewers, the paper presents a solid contribution to the study of gender biases in language models and machine translation systems, although there are areas for improvement in methodology and interpretation.

Summary of Strengths:

The creation of the GEST dataset, guided by consultations with gender experts, provides a valuable resource for studying stereotypical behavior in language models. The inclusion of nine Slavic languages and 16 categories of stereotypes allows for a comprehensive analysis of biases, filling a gap left by previous benchmarks.

Summary of Weaknesses:

The evaluation metrics introduced may benefit from refinement, as their interpretability and robustness are questioned. Specifically, the disparity in metric ranges and the reliance on automatic steps introduce potential biases and inaccuracies. The paper is criticized for potentially biasing model outputs by discarding gender-neutral translations and focusing solely on female- and male-coded tokens in language models, without considering the possibility of non-stereotypical outputs. Some reviewers question the novelty of the findings, suggesting that the identified biases are already well-known in the field.