Beyond Overlap Metrics: Rewarding Reasoning and Preferences for Faithful Multi-Role Dialogue Summarization

Multi-role dialogue summarization requires modeling complex interactions among multiple speakers while preserving role-specific information and factual consistency. However, most existing methods optimize for automatic metrics such as ROUGE and BERTScore, which favor surface-level imitation of references rather than genuine gains in faithfulness or alignment with human preferences. We propose a novel framework that couples explicit cognitive-style reasoning with reward-based optimization for multi-role dialogue summarization. Our method first distills structured reasoning traces (e.g., step-by-step inferences and intermediate reflections) from a large teacher model and uses them as auxiliary supervision to initialize a reasoning-aware summarizer via staged supervised fine-tuning. It then applies GRPO with a dual-principle reward that blends metric-based signals with human-aligned criteria targeting key information coverage, implicit inference, factual faithfulness, and conciseness. Experiments on multilingual multi-role dialogue benchmarks show that our method matches strong baselines on ROUGE and BERTScore. Specifically, results on CSDS confirm the framework's stability in semantic consistency, while in-depth analysis on SAMSum demonstrates clear gains in factual faithfulness and model-based preference alignment. These findings underscore the value of reasoning-aware and preference-aware training for reliable dialogue summarization. Checkpoints and datasets are available at https://huggingface.co/collections/NebulaPixel/summorchestra-multirole-summary.

翻译：多角色对话摘要需要建模多个说话者之间的复杂交互，同时保留角色特定信息和事实一致性。然而，现有大多数方法优化的是ROUGE和BERTScore等自动度量，这些度量偏好对参考文本的表面级模仿，而非在忠实性或与人类偏好对齐方面的真正提升。我们提出了一种新颖框架，将显式认知风格推理与基于奖励的优化相结合，用于多角色对话摘要。该方法首先从大型教师模型中提炼结构化推理轨迹（如逐步推理和中间反思），并将其作为辅助监督信号，通过分阶段监督微调初始化一个推理感知摘要器。随后，该方法应用GRPO与双原则奖励，该奖励融合了基于度量的信号和以人类偏好为准则的目标，涵盖关键信息覆盖、隐式推理、事实忠实性和简洁性。在多语言多角色对话基准上的实验表明，我们的方法在ROUGE和BERTScore上可与强基线方法媲美。具体而言，CSDS上的结果证实了该框架在语义一致性上的稳定性，而SAMSum上的深入分析则展示了在事实忠实性和基于模型的偏好对齐方面的显著提升。这些发现强调了推理感知和偏好感知训练对于可靠对话摘要的价值。检查点和数据集可在https://huggingface.co/collections/NebulaPixel/summorchestra-multirole-summary获取。