Research funding agencies are increasingly exploring automated tools to support early-stage proposal screening. Recent advances in large language models (LLMs) have generated optimism regarding their use for text-based evaluation, yet their institutional suitability for high-stakes screening decisions remains underexplored. In particular, there is limited empirical evidence on how automated screening systems perform when evaluated against institutional error costs. This study compares two automated approaches for proposal screening against the priorities of a national funding call: A transparent, rule-based method using term frequency-inverse document frequency (TF-IDF) with domain-specific keyword engineering, and a semantic classification approach based on a large language model. Using selection committee decisions as ground truth for 959 proposals, we evaluate performance with particular attention to error structure. The results show that the TF-IDF-based approach outperforms the LLM-based system across standard metrics, achieving substantially higher recall (78.95\% vs 45.82\%) and producing far fewer false negatives (68 vs 175). The LLM-based system excludes more than half of the proposals ultimately selected by the committee. While false positives can be corrected through subsequent peer review, false negatives represent an irrecoverable exclusion from expert evaluation. By foregrounding error asymmetry and institutional context, this study demonstrates that the suitability of automated screening systems depends not on model sophistication alone, but on how their error profiles, transparency, and auditability align with research evaluation practice. These findings suggest that evaluation design and error tolerance should guide the use of AI-assisted screening tools in research funding more broadly.
翻译:研究资助机构正日益探索使用自动化工具来支持早期提案筛选。大型语言模型(LLM)的最新进展使其在基于文本的评估应用中展现出潜力,然而,这些模型在高风险筛选决策中的制度适用性仍未得到充分探究。特别是,关于自动化筛选系统在机构错误成本评估下的实际表现,目前缺乏充分的实证证据。本研究针对一项国家资助计划的优先方向,比较了两种自动化提案筛选方法:一种是使用词频-逆文档频率(TF-IDF)结合领域特定关键词工程的透明、基于规则的方法;另一种是基于大型语言模型的语义分类方法。以959份提案的遴选委员会决策作为基准事实,我们重点评估了系统的错误结构。结果表明,基于TF-IDF的方法在各项标准指标上均优于基于LLM的系统,实现了显著更高的召回率(78.95% 对比 45.82%),并产生了更少的假阴性错误(68例对比175例)。基于LLM的系统排除了超过一半最终被委员会选中的提案。虽然假阳性错误可通过后续同行评议予以纠正,但假阴性错误意味着提案将永久失去接受专家评估的机会。通过突出错误不对称性与制度背景,本研究证明:自动化筛选系统的适用性不仅取决于模型本身的复杂程度,更关键的是其错误分布、透明度及可审计性如何与研究评估实践相契合。这些发现表明,评估设计与错误容忍度应在更广泛层面上指导人工智能辅助筛选工具在研究资助中的应用。