Paper Summary:

This paper explores gender stereotypes across a range of language models and machine translation systems for English and several Slavic languages. The authors introduce a new evaluation benchmark, GEST, featuring 16 categories of bias stereotypes and propose unique metrics to assess gender bias in these systems.

This paper offers valuable insights into the study of gender biases in AI. However, it requires specific revisions to enhance its clarity and robustness. I rated this paper a 3, but it straddles the line between a 3 and a 4. The paper is fundamentally solid but could be greatly improved by incorporating the suggestions from the reviewers and this meta-review.

Summary Of Strengths:

The paper is particularly strong in its novelty and relevance; it introduces the GEST benchmark and an accompanying dataset that spans English and nine Slavic languages, addressing a gap in existing benchmarks that are predominantly focused on major languages. This allows for a broader investigation of gender biases in AI systems. The methodological rigor is another significant strength, with the development of specific metrics to quantify gender stereotypes in language models and translation systems offering a detailed analysis of bias. These contributions are critical for understanding and mitigating gender bias in AI. Additionally, the utility of the resources provided, including the datasets and metrics, is a substantial boon for researchers in the field, furnishing tools for further exploration of how stereotypes are propagated through AI systems.

Summary Of Weaknesses:

The paper falls short in several areas that need attention. First, the evaluation metrics introduced for measuring gender bias require clearer definitions and more robust validation. These metrics need to accurately reflect what they measure and be interpretable to the readers. The paper's lack of exploration into the implications of detected biases on downstream applications also limits its practical relevance. Understanding how gender bias in language models impacts real-world tasks such as translation accuracy or co-reference resolution could significantly enhance the paper's applicability. Moreover, the comparative analysis with related works is insufficient. A more thorough comparison and a clearer exposition of how this study differs from and improves upon previous efforts would solidify the paper's place in the literature.

The authors should consider revising their metrics to ensure they accurately reflect the presence of gender biases and are interpretable to practitioners and researchers. Additionally, expanding the discussion on how detected biases affect practical applications will make the findings more relevant to a wider audience. This could involve linking bias metrics to performance in specific tasks through case studies or additional experiments. Enhancing the comparative analysis with existing benchmarks and clarifying the unique contributions of this study would also help position the paper more clearly within the existing research landscape. Addressing minor presentation issues such as overlapping figures, ambiguous statements, and providing clearer definitions for terms like "gender experts" would also improve the paper's clarity and professionalism.