Reward Modeling for Scientific Writing Evaluation

Scientific writing is an expert-domain task that demands deep domain knowledge, task-specific requirements and reasoning capabilities that leverage the domain knowledge to satisfy the task specifications. While scientific text generation has been widely studied, its evaluation remains a challenging and open problem. It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks while adhering to their distinct requirements. However, existing LLM-based judges and reward models are primarily optimized for general-purpose benchmarks with fixed scoring rubrics and evaluation criteria. Consequently, they often fail to reason over sparse knowledge of scientific domains when interpreting task-dependent and multi-faceted criteria. Moreover, fine-tuning for each individual task is costly and impractical for low-resource settings. To bridge these gaps, we propose cost-efficient, open-source reward models tailored for scientific writing evaluation. We introduce a two-stage training framework that initially optimizes scientific evaluation preferences and then refines reasoning capabilities. Our multi-aspect evaluation design and joint training across diverse tasks enable fine-grained assessment and robustness to dynamic criteria and scoring rubrics. Experimental analysis shows that our training regime strongly improves LLM-based scientific writing evaluation. Our models generalize effectively across tasks and to previously unseen scientific writing evaluation settings, allowing a single trained evaluator to be reused without task-specific retraining.

翻译：科学写作是一项专业领域任务，需要深厚的领域知识、任务特定要求以及利用领域知识满足任务规范的推理能力。尽管科学文本生成已得到广泛研究，但其评估仍是一个具有挑战性的开放性问题。开发能够可靠部署于评估多样化开放式科学写作任务并遵循其独特要求的模型至关重要。然而，现有的基于大语言模型的评判器和奖励模型主要针对具有固定评分标准和评估指标的通用基准进行优化。因此，在解释任务依赖性和多维度标准时，它们往往难以对科学领域的稀疏知识进行推理。此外，为每个独立任务进行微调在低资源场景下成本高昂且不切实际。为弥补这些差距，我们提出了专为科学写作评估设计的成本效益高、开源的奖励模型。我们引入了一个两阶段训练框架：首先优化科学评估偏好，随后精炼推理能力。我们的多维度评估设计及跨任务联合训练实现了细粒度评估，并对动态标准和评分规则具有鲁棒性。实验分析表明，我们的训练机制显著提升了基于大语言模型的科学写作评估效果。所提出的模型能够有效跨任务泛化，并适应先前未见的科学写作评估场景，使得单个训练后的评估器无需任务特定重新训练即可重复使用。