Recent advances in agentic workflows have enabled the automation of tasks such as professional document generation. However, they primarily focus on textual quality, neglecting visual structure and style, which are crucial for readability and engagement. This gap stems mainly from a lack of effective reward models capable of guiding agents toward producing documents with high structural and stylistic professionalism. To address this, we propose DocReward, a document reward model that evaluates documents based on their structure and style. The model is trained under a textual-quality-agnostic framework to assess professionalism without being influenced by textual quality. To achieve this, we construct a multi-domain dataset DocPair of 117K paired documents, covering 32 domains and 267 document types, each comprising a high- and low-professionalism document with identical content but different structure and style. This setup enables the model to evaluate professionalism comprehensively and independently of textual quality. DocReward is trained using the Bradley-Terry loss to score documents, penalizing predictions that contradict the annotated ranking. On a manually annotated benchmark, DocReward outperforms GPT-5 by 14.6 percentage points in accuracy. Extrinsic RL experiments further validate its effectiveness in guiding professional document generation.
翻译:智能体工作流的最新进展已实现专业文档生成等任务的自动化。然而,现有方法主要关注文本质量,忽视了视觉结构与风格这两个对可读性和用户参与度至关重要的维度。这一缺陷主要源于缺乏能够有效引导智能体生成具备高结构性与风格专业度文档的奖励模型。为此,我们提出DocReward——一种基于文档结构与风格进行评估的文档奖励模型。该模型在文本质量无关的框架下进行训练,从而确保专业度评估不受文本质量干扰。为实现这一目标,我们构建了包含117K对文档的多领域数据集DocPair,涵盖32个领域和267种文档类型,每对文档包含内容相同但结构与风格不同的高专业度与低专业度版本。该设计使模型能够全面且独立于文本质量地评估专业度。DocReward采用Bradley-Terry损失函数进行训练,通过惩罚与标注排序相悖的预测来实现文档评分。在人工标注基准测试中,DocReward的准确率较GPT-5提升14.6个百分点。外部强化学习实验进一步验证了其在引导专业文档生成方面的有效性。