Description
Peer review has a noise problem. The "reviewer lottery" means a manuscript's fate often depends on who is assigned, their topical familiarity, their individual priorities, and how much time they have that week. This idiosyncratic variance treats similar work inconsistently: a clear source of unfairness.
Large Language Models (LLMs) can substantially reduce this noise. Even non-deterministic LLMs would appear to deliver far more consistent assessments than human reviewers, whose evaluations vary wildly across individuals and occasions. On noise reduction alone, the case for LLM integration is strong.
Bias, however, is more complicated. Human peer review is not merely noisy, but biased against low-prestige institutions, specific geographies, and other status markers, even under blinding. Yet we worry about algorithmic bias intensely while treating human bias as an unfortunate but acceptable baseline. This asymmetry is unjustified.
Both systems are biased, but their biases differ in structure. Algorithmic bias is relatively consistent, thus detectable and potentially correctable. Human bias is distributed and harder to measure. Yet inconsistency cuts both ways: distributed bias at least allows some unconventional work to pass through sympathetic reviewers, whereas consistent bias offers no such escape. Whether these trade-offs favour humans or algorithms is an empirical question that needs addressing.
In principle, triangulating between human and algorithmic assessment could help detect and reduce bias in both. But realising this requires transparency. Blanket prohibitions, as currently practised, push LLM use underground, making genuine comparison and mutual learning impossible. We need open experiments and conversations, not bans - and I would like to discuss how to approach this challenge.
Potential discussion questions
What low-risk research could compare human-only vs. LLM-augmented review on both consistency (noise) and fairness (bias)? How much of this can be done with existing (open) data? How can LLMs 'augment' rather than replace human review?