小型与推理大型语言模型能否评估期刊论文研究质量？平均化与少样本提示是否有助提升？ (Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?)

from arxiv, Thelwall, M. & Mohammadi, E. (2026). Can small and reasoning Large Language Models score journal articles for research quality and do averaging and few-shot help? Scientometrics

Previous research has shown that journal article quality ratings from the cloud based Large Language Model (LLM) families ChatGPT and Gemini and the medium sized open weights LLM Gemma3 27b correlate moderately with expert research quality scores. This article assesses whether other medium sized LLMs, smaller LLMs, and reasoning models have similar abilities. This is tested with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1 on a dataset of 2,780 medical, health and life science papers in 6 fields, with two different gold standards, one novel. Few-shot and score averaging approaches are also evaluated. The results suggest that medium-sized LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash, but that 1b parameters may often, and 4b sometimes, be too few. Reasoning models did not have a clear advantage. Moreover, averaging scores from multiple identical queries seems to be a universally successful strategy, and there is weak evidence that few-shot prompts (four examples) tend to help. Overall, the results show, for the first time, that smaller LLMs >4b have a substantial capability to rate journal articles for research quality, especially if score averaging is used, but that reasoning does not give an advantage for this task; it is therefore not recommended because it is slow. The use of LLMs to support research evaluation is now more credible since multiple variants have a similar ability, including many that can be deployed offline in a secure environment without substantial computing resources.

翻译：先前研究表明，基于云端的大型语言模型（LLM）家族ChatGPT和Gemini，以及中等规模开源权重模型Gemma3 27b对期刊论文质量的评分与专家研究质量评分呈中等程度相关。本文评估其他中等规模LLM、小型LLM及推理模型是否具备类似能力。研究使用Gemma3变体、Llama4 Scout、Qwen3、Magistral Small和DeepSeek R1模型，在包含2,780篇医学、健康与生命科学领域论文的数据集上进行了测试，涉及两个不同的黄金标准（其中一项为新型标准），同时评估了少样本提示与分数平均化方法的效果。结果表明：中等规模LLM的表现与ChatGPT 4o-mini和Gemini 2.0 Flash相近，但10亿参数模型通常、40亿参数模型有时可能参数量不足；推理模型未显现明显优势。此外，对同一查询进行多次评分并取平均值是普遍有效的策略，并有微弱证据表明少样本提示（四个示例）往往能带来帮助。总体而言，本研究首次证明参数量大于40亿的小型LLM在评估期刊论文研究质量方面具有显著能力（尤其在采用分数平均化策略时），但推理能力并未为此任务带来优势，且因其速度缓慢而不建议采用。由于多种模型变体（包括许多可在安全环境中离线部署且无需大量计算资源的模型）均表现出相近的能力，使用LLM辅助研究评估的可信度现已显著提升。