Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning systems and efficient small models, or lack fine-grained, multi-dimensional assessments. To bridge these gaps, we propose OmniCSEval, a unified benchmark comprising 1,800 diverse conversations across six real-world scenarios, featuring context lengths ranging from 128 to 32k tokens. For fine-grained evaluation, we employ a bidirectional fact-checking framework that integrates key fact matching to assess completeness and conciseness, alongside summary fact verification to evaluate faithfulness. To ensure reliable assessment, we establish a human-LLM collaborative pipeline for key fact extraction and a multi-LLM consensus verifier for summary fact decomposition. Leveraging this framework, we evaluate 28 LLMs across four distinct categories grouped by reasoning capability and model scale. Our extensive empirical study reveals critical insights regarding the cross-scenario challenges current LLMs continue to face, the impacts of reasoning and scale, and the efficiency and adaptability of reasoning models. We also provide guidance for system selection in real-world deployments.
翻译:尽管大语言模型(LLMs)在对话摘要任务中取得了显著进展,其评估仍受限于场景不足、输入长度有限及样本量不足等问题。现有基准测试往往忽略前沿推理系统与高效小型模型,或缺乏细粒度的多维度评估。为解决上述不足,我们提出统一基准OmniCSEval,包含涵盖六个真实场景的1800组多样化对话,上下文长度从128至32K tokens不等。为进行细粒度评估,我们采用双向事实核查框架:通过关键事实匹配评估完整性与简洁性,通过摘要事实验证评估忠实度。为确保评估可靠性,我们建立了人机协作的关键事实提取流水线,以及多模型共识验证器用于摘要事实分解。基于该框架,我们按推理能力与模型规模分四类评估了28个LLMs。大规模实证研究揭示出关键洞察:当前LLMs持续面临的跨场景挑战、推理能力与模型规模的影响、以及推理模型在效率与适应性方面的表现。我们还为实际部署中的系统选择提供了指导意见。