Large language models (LLMs) are currently applied to scientific paper evaluation by assigning an absolute score to each paper independently. However, since score scales vary across conferences, time periods, and evaluation criteria, models trained on absolute scores are prone to fitting narrow, context-specific rules rather than developing robust scholarly judgment. To overcome this limitation, we propose shifting paper evaluation from isolated scoring to collaborative ranking. In particular, we design \textbf{C}omparison-\textbf{N}ative framework for \textbf{P}aper \textbf{E}valuation (\textbf{CNPE}), integrating comparison into both data construction and model learning. We first propose a graph-based similarity ranking algorithm to facilitate the sampling of more informative and discriminative paper pairs from a collection. We then enhance relative quality judgment through supervised fine-tuning and reinforcement learning with comparison-based rewards. At inference, the model performs pairwise comparisons over sampled paper pairs and aggregates these preference signals into a global relative quality ranking. Experimental results demonstrate that our framework achieves an average relative improvement of \textbf{21.8\%} over the strong baseline DeepReview-14B, while exhibiting robust generalization to five previously unseen datasets. \href{https://github.com/ECNU-Text-Computing/ComparisonReview}{Code}.
翻译:大语言模型(LLMs)当前应用于科学论文评估时,通常对每篇论文独立赋予绝对分数。然而,由于评分标准因会议、时间跨度和评估准则而异,基于绝对分数训练的模型容易拟合狭窄的上下文特定规则,难以形成稳健的学术判断。为克服这一局限,我们提出将论文评估从孤立评分转向协同排序。具体而言,我们设计了面向比较的论文评估框架(CNPE),将比较机制融入数据构建与模型学习两个阶段。首先提出基于图结构的相似性排序算法,从论文集合中采样更具信息量和判别性的论文对;随后通过监督微调与基于比较的强化学习增强模型的相对质量判断能力。在推理阶段,模型对采样的论文对进行两两比较,并将偏好信号聚合成全局相对质量排序。实验表明,该框架在强基线模型DeepReview-14B基础上实现平均相对提升21.8%,并在五个未见数据集上展现稳健泛化能力。代码地址:\href{https://github.com/ECNU-Text-Computing/ComparisonReview}{GitHub}