You can view our Privacy Policy here and our imprint here.

7–11 Apr 2026
Machynlleth, Wales, UK
Europe/Dublin timezone

Reducing Noise, Interrogating Bias: The Case for LLM Experimentation in Peer Review

Not scheduled
20m
Machynlleth, Wales, UK

Machynlleth, Wales, UK

Llwyngwern Quarry, Pantperthog, Machynlleth SY20 9AZ, UK Machynlleth https://cat.org.uk/

Description

Peer review has a noise problem. The "reviewer lottery" means a manuscript's fate often depends on who is assigned, their topical familiarity, their individual priorities, and how much time they have that week. This idiosyncratic variance treats similar work inconsistently: a clear source of unfairness.

Large Language Models (LLMs) can substantially reduce this noise. Even non-deterministic LLMs would appear to deliver far more consistent assessments than human reviewers, whose evaluations vary wildly across individuals and occasions. On noise reduction alone, the case for LLM integration is strong.

Bias, however, is more complicated. Human peer review is not merely noisy, but biased against low-prestige institutions, specific geographies, and other status markers, even under blinding. Yet we worry about algorithmic bias intensely while treating human bias as an unfortunate but acceptable baseline. This asymmetry is unjustified.

Both systems are biased, but their biases differ in structure. Algorithmic bias is relatively consistent, thus detectable and potentially correctable. Human bias is distributed and harder to measure. Yet inconsistency cuts both ways: distributed bias at least allows some unconventional work to pass through sympathetic reviewers, whereas consistent bias offers no such escape. Whether these trade-offs favour humans or algorithms is an empirical question that needs addressing.

In principle, triangulating between human and algorithmic assessment could help detect and reduce bias in both. But realising this requires transparency. Blanket prohibitions, as currently practised, push LLM use underground, making genuine comparison and mutual learning impossible. We need open experiments and conversations, not bans - and I would like to discuss how to approach this challenge.

Potential discussion questions
What low-risk research could compare human-only vs. LLM-augmented review on both consistency (noise) and fairness (bias)? How much of this can be done with existing (open) data? How can LLMs 'augment' rather than replace human review?

Author

Lukas Wallrich (Birkbeck, University of London)

Presentation materials

There are no materials yet.