※ Open Samizdat

2024-04-25

Did we just receive an AI-generated meta-review?

Abstract I recently received what I believe is an LLM-generated meta-review for one of my papers. This is the first time for me, but I am afraid that it is not the last. I will talk about why I think this review was generated automatically, but more importantly, I would like to start a debate about the ethics of using LLMs in the reviewing process. I believe that the entire trust-based concept of peer review might be threatened with the proliferation of LLMs.

⁂

Last February, we submitted a paper about the GEST dataset to the ACL Rolling Review system. In April, we have obtained 3 reviews and then a meta-review. As soon as I read the meta-review, I smelled a text generated with an LLM. While there are valid ways of using LLMs for reviewing, for example as writing assistants, they still require attentive human oversight. I came to a conclusion that the level of human oversight in this case might have been less than I would prefer. The meta-review is not completely unreasonable and mostly summarizes points raised by the reviewers, but there are still some issues that might have been caused by how it was created. The fact that I can not tell know how much human effort actually went into it bothers me as well.

This blog has two sections. The Evidence section describes why I believe that the meta-review is LLM-generated. Other than going by vibes, it is surprisingly difficult to describe what exactly smells like an LLM in a text. I will also go through various issues of how the ideas are formulated in the meta-review, such as repetitiveness or hallucinations, and how these issues might have been caused by the generation process. I think this is a useful exercise, mainly because it provides insights into what you can expect when you receive an automatically generated meta-review.

The Implications section discusses the impacts on authors when they receive LLM-generated reviews, as well as the impacts on the entire peer review process and the scientific community at large. LLMs have the capability to damage peer review as one of the cornerstones of today’s science, and there are signs that it is already happening. My conclusion is that they should be completely banned from the review process.

My goal here is to describe how it feels to be on the receiving end of such a review, but also to open the discussion about using LLMs in reviewing. As far as I am aware, the top AI conferences are just formulating their positions on this issue. My guess is that other fields lag behind as they do not have experts in the field readily available. The risk that LLMs will be misused and will harm authors and their careers is imminent and immense.

⁂

Evidence

This analysis is speculative, but I consider the evidence for the hypothesis that the meta-review was indeed LLM-generated to be really strong. Still, I might be proven wrong. The meta-review can be found here.

Style

My original suspicion came from the very specific style of the text. I work with LLMs as a researcher almost daily, so I would like to think that I have a pretty fine-tuned sense for detecting their outputs. In general, the style is very verbose and almost tiresome in how little meat, and how much noise there is. Consider the following snippet:

Additionally, the utility of the resources provided, including the datasets and metrics, is a substantial boon for researchers in the field, furnishing tools for further exploration of how stereotypes are propagated through AI systems.

This is such a convoluted way of basically saying that the provided datasets and metrics are useful. It would be highly unusual for a human reviewer to write like this. The reviews are usually much more concise and to the point. Based on this smell, I decided to check what AI detection tools have to say.

AI detection tools

I have some experience with training and benchmarking tools for detecting AI-generated texts, so I also know that they are usually far from perfect. To be sure, I have decided to use an ensemble of such tools. I googled "ai text detection" and used the top 5 hits. I let the tools detect (1) the meta-review we received, (2) one of the human-written reviews we received, and (3) a meta-review I generated with ChatGPT and edited to match the format.

Tool	(1) Our meta-review	(2) Human review	(3) ChatGPT meta-review
Copyleaks	100%	0%	100%
QuillBot	100%	0%	80%
Scribbr	86%	14%	100%
ZeroGPT	0%	0%	0%
GPTZero	100%	0%	100%

Apart from ZeroGPT which does not seem to work at all, all the other tools give positive results for both the meta-review we received and the one I have generated with ChatGPT. The tools that highlight what specific parts are LLM-generated all highlighted the entire meta-review, including the discussion about the final verdict. This little experiment further confirmed my suspicions. Based on this, I decided to analyze the content of the review more thoroughly.

Repetitive text

The first thing I noticed after reading the meta-review more carefully is the repetitiveness in basically all the paragraphs. The most egregious example is the two paragraphs in the weaknesses section. I have color-coded the sentences that are about the same objections to make it more obvious:

The paper falls short in several areas that need attention. First, the evaluation metrics introduced for measuring gender bias require clearer definitions and more robust validation. These metrics need to accurately reflect what they measure and be interpretable to the readers. The paper’s lack of exploration into the implications of detected biases on downstream applications also limits its practical relevance. Understanding how gender bias in language models impacts real-world tasks such as translation accuracy or co-reference resolution could significantly enhance the paper’s applicability. Moreover, the comparative analysis with related works is insufficient. A more thorough comparison and a clearer exposition of how this study differs from and improves upon previous efforts would solidify the paper’s place in the literature.

The authors should consider revising their metrics to ensure they accurately reflect the presence of gender biases and are interpretable to practitioners and researchers. Additionally, expanding the discussion on how detected biases affect practical applications will make the findings more relevant to a wider audience. This could involve linking bias metrics to performance in specific tasks through case studies or additional experiments. Enhancing the comparative analysis with existing benchmarks and clarifying the unique contributions of this study would also help position the paper more clearly within the existing research landscape. Addressing minor presentation issues such as overlapping figures, ambiguous statements, and providing clearer definitions for terms like “gender experts” would also improve the paper’s clarity and professionalism.

This is basically the same paragraph written twice! Both paragraphs have the same 3 objections in the same order, and sometimes they even use identical wording (e.g., “accurately reflect”, “comparative analysis”). I do not believe that a human could write like this by mistake.

Hallucinated objections

Next, I looked for statements that could be hallucinations — stuff that the model made up. I found several such cases, although they are all inspired by what was actually said by the reviewers. This is not an entirely LLM-specific risk. A human meta-reviewer might misunderstand or make wrong assumptions about the reviews as well. Fortunately, these hallucinations are all more or less not that important for the overall message of the meta-review.

“Understanding how gender bias in language models impacts real-world tasks such as translation accuracy [...] could significantly enhance the paper’s applicability.” — This objection does not make sense, because we do not use language models for machine translation, and our dataset also cannot really be used to measure translation accuracy. I believe that this objection is based on the following snippet from a human-written review: “[...] the issue of its effect on downstream tasks isn’t really discussed. For instance, are gendered translations worse for these models when translating. Is co-reference resolution performance correlated to stereotypical gender assumptions?” The problem here is that the first part of this human remark was made about machine translation systems, but the meta-review assumes that it is about language models.
“Addressing minor presentation issues such as overlapping figures [...] would also improve the paper’s clarity and professionalism.” — There are no overlapping figures in our paper. This is probably based on a similar remark made by one of the reviewers: “Figure 6 is hard to read since the text overlaps significantly.” These two sentences have clearly different meanings. Also, did this LLM just judge our professionalism? Rude.
“Additionally, the utility of the resources provided, including the datasets and metrics, is a substantial boon for researchers in the field, furnishing tools for further exploration of how stereotypes are propagated through AI systems.” — There is only one dataset (singular) introduced in the paper, not datasets (plural). This is a minor nitpick, but it is interesting because datasets (plural) is mentioned only in the canned responses that the reviewers select: “Datasets: 4 = Useful: I would recommend the new datasets to other researchers or developers for their ongoing work.” The motif or recommending the datasets to researchers might have come from this response as well.
“the evaluation metrics [...] require clearer definitions” — Arguably, no reviewer claims that the metrics are not clearly defined. One reviewer raised a question about how clear the interpretability of a metric is, but problems with the clarity of definition usually mean something different.

Contradictions

This is a bit subtle, but the strengths and weaknesses sections contradict each other. When these two sections describe the metrics, they use significantly different language. On one hand, our “methodological rigor is significant strength”, our metrics are “critical for understanding and mitigating gender bias in AI”, and they are even “a substantial boon for researchers”. Based on this language, our metrics seem pretty great! But at the same time, we “should consider revising our metrics to ensure they accurately reflect the presence of gender biases” and they also “require clearer definitions and more robust validation.” These two viewpoints rule each other out, and I have a hard time imagining human writing such contradictory remarks.

⁂

Implications

Considering the evidence above, I am convinced that our meta-review is LLM-generated. The question remains: how much human oversight went into the generation process? Some of the issues in our meta-review make me think that the amount of oversight in this case might have been less than I would have liked. An attentive meta-reviewer should have noticed that two generated paragraphs are almost identical in meaning; should have ensured that the reviewers’ objections are reproduced carefully; and should have ensured that there are no contradictions in the text. And if the meta-reviewer was not that attentive, I find it hard to trust their judgment when they decided on the final verdict for our paper.

The unfairness of not getting the paper we spent months on properly reviewed is not the only problem here. We now also have to deal with the consequences of the meta-review — writing rebuttals, revisions, cover letters, conducting experiments — all based on what an LLM generated. The reviewers in the next round will also have to read the generated text and reflect upon it. An LLM-generated review for one paper can easily lead to tens of researcher human-hours spent pondering what a stochastic parrot said.

Despite my speculations, it is impossible to tell how much human input actually went into this meta-review. The meta-reviewer might have simply copy-pasted entire reviews and asked an LLM to summarize them. It is completely feasible for LLMs nowadays to write a meta-review such as the one we obtained this way. Or, the meta-reviewer could have actually done their job properly, engaging with the reviews and discussion and only using an LLM to summarize their own notes. With LLMs being as available as they are now, there will always be this uncertainty in the peer review process from now on.

⁂

This uncertainty is an extremely destructive concept that can easily endanger trust within scientific communities. Subjectively, trust in peer review in the AI community was pretty low even before LLMs. If we start having a significant percentage of reviews generated automatically, trust will plummet even more. The publish-or-perish culture is the underlying root cause here, creating all the wrong incentives, but the misuse of LLMs can definitely synergize with it and reinforce the negative feedback loop.

Worryingly, our paper is not an exception. Up to 16.9% of reviews in high-impact conferences such as NeurIPS, ICLR, or EMNLP are already generated with LLMs . This is frankly a shocking number. If this number is true, you have only a 48% chance of receiving a fully human set of three reviews and one meta-review (0.831⁴). The authors also show that these LLM-generated reviews are systematically less confident and submitted closer to the deadline, indicating that this cannot be attributed solely to using LLMs as writing assistants. Even a single LLM-generated review can lead to tens of person-hours being lost on unnecessary work. If we scale this to the jumbo conferences with thousands of papers, we could be seriously talking about several human-decades worth of researcher time being lost due to LLMs every time a major conference is happening!

If we continue to allow LLMs to influence the peer review process, we are firmly entering cargo cult science territory — we are pretending that scientific discourse is happening and emulating all the usual steps, but the core idea of why this process was originally created eludes us. My conclusion is that the use of LLMs for reviewing should be strictly forbidden. What they bring to the table as writing assistants is not worth the harm they can cause. Most researchers would rather receive a review with bad English they can trust than an eloquent hallucination. The entire premise of peer review is that an expert in the field carefully considers your paper. Banning LLMs would unfortunately not guarantee this bare minimum, but at least it would take away a readily available tool that tempts some reviewers to make their work easier.

⁂

Cite

@misc{pikuliak_llm_meta_review,
  author       = "Matúš Pikuliak",
  title        = "Did we just receive an AI-generated meta-review?",
  howpublished = "https://www.opensamizdat.com/posts/llm_meta_review",
  month        = "04",
  year         = "2024",
}

⁂