As AI-assisted grant proposals outpace manual review capacity in a kind of ``Malthusian trap'' for the research ecosystem, this paper investigates the capabilities and limitations of LLM-based grant reviewing for high-stakes evaluation. Using six EPSRC proposals, we develop a perturbation-based framework probing LLM sensitivity across six quality axes: funding, timeline, competency, alignment, clarity, and impact. We compare three review architectures: single-pass review, section-by-section analysis, and a 'Council of Personas' ensemble emulating expert panels. The section-level approach significantly outperforms alternatives in both detection rate and scoring reliability, while the computationally expensive council method performs no better than baseline. Detection varies substantially by perturbation type, with alignment issues readily identified but clarity flaws largely missed by all systems. Human evaluation shows LLM feedback is largely valid but skewed toward compliance checking over holistic assessment. We conclude that current LLMs may provide supplementary value within EPSRC review but exhibit high variability and misaligned review priorities. We release our code and any non-protected data.
翻译:随着AI辅助的资助提案数量超过人工评审能力,形成研究生态系统中的一种“马尔萨斯陷阱”,本文研究了基于LLM的资助评审在高风险评估中的能力与局限。利用六份EPSRC提案,我们开发了基于扰动的评估框架,从六个质量维度检验LLM的敏感性:资助合理性、时间规划、团队能力、目标契合度、表述清晰度和研究影响力。我们比较了三种评审架构:单次整体评审、分章节分析以及模拟专家小组的“角色委员会”集成方法。章节级分析方法在检测率和评分可靠性上均显著优于其他方案,而计算成本高昂的委员会方法表现未超越基线水平。不同扰动类型的检测效果差异显著:目标契合度问题容易被识别,但所有系统均普遍忽略表述清晰度缺陷。人工评估表明LLM反馈基本有效,但偏向合规性检查而非整体性评估。我们得出结论:当前LLM可能在EPSRC评审体系中提供辅助价值,但存在高度可变性和评审重点错位问题。我们已公开相关代码及非受保护数据。