Assessing the quality of scientific research is essential for scholarly communication, yet widely used approaches face limitations in scalability, subjectivity, and time delay. Recent advances in large language models (LLMs) offer new opportunities for automated research evaluation based on textual content. This study examines whether LLMs can support post-publication peer review tasks by benchmarking their outputs against expert judgments and citation-based indicators. Two evaluation tasks are constructed using articles from the H1 Connect platform: identifying high-quality articles and performing finer-grained evaluation including article rating, merit classification, and expert style commenting. Multiple model families, including BERT models, general-purpose LLMs, and reasoning oriented LLMs, are evaluated under multiple learning strategies. Results show that LLMs perform well in coarse grained evaluation tasks, achieving accuracy above 0.8 in identifying highly recommended articles. However, performance decreases substantially in fine-grained rating tasks. Few-shot prompting improves performance over zero-shot settings, while supervised fine-tuning produces the strongest and most balanced results. Retrieval augmented prompting improves classification accuracy in some cases but does not consistently strengthen alignment with citation indicators. The overall correlations between model outputs and citation indicators remain positive but moderate.
翻译:科研质量评估对学术交流至关重要,但现有方法存在可扩展性不足、主观性强与时效滞后等局限。大语言模型的最新进展为基于文本内容的自动化研究评估提供了新机遇。本研究通过将大语言模型输出与专家判断及引文指标进行基准测试,探究其在出版后同行评审任务中的支持能力。基于H1 Connect平台构建两类评估任务:高质量论文甄别,以及涵盖论文评分、价值分类与专家风格评论的细粒度评估。在多种学习策略下,对BERT模型、通用大语言模型及推理型大语言模型等多个模型家族进行系统评估。结果表明:大语言模型在粗粒度评估任务中表现优异,识别高推荐度论文的准确率超过0.8;但在细粒度评分任务中性能显著下降。相较于零样本设置,少样本提示能够提升模型性能,而监督微调则产生最强且最均衡的结果。检索增强提示虽可在特定场景下提高分类准确率,但未能持续增强与引文指标的对齐程度。模型输出与引文指标的整体相关性呈正向但中等水平。