Existing benchmarks for summarization quality evaluation often lack diverse input scenarios, focus on narrowly defined dimensions (e.g., faithfulness), and struggle with subjective and coarse-grained annotation schemes. To address these shortcomings, we create UniSumEval benchmark, which extends the range of input context (e.g., domain, length) and provides fine-grained, multi-dimensional annotations. We use AI assistance in data creation, identifying potentially hallucinogenic input texts, and also helping human annotators reduce the difficulty of fine-grained annotation tasks. With UniSumEval, we benchmark nine latest language models as summarizers, offering insights into their performance across varying input contexts and evaluation dimensions. Furthermore, we conduct a thorough comparison of SOTA automated summary evaluators. Our benchmark data will be available at https://github.com/DISL-Lab/UniSumEval-v1.0.
翻译:现有的摘要质量评估基准通常缺乏多样化的输入场景,聚焦于狭义定义的维度(如忠实度),且难以处理主观且粗粒度的标注方案。为弥补这些不足,我们创建了UniSumEval基准,该基准扩展了输入上下文的范围(如领域、长度),并提供细粒度、多维度的标注。我们在数据创建中借助人工智能辅助,识别可能产生幻觉的输入文本,并帮助人工标注者降低细粒度标注任务的难度。利用UniSumEval,我们对九种最新的语言模型作为摘要生成器进行了基准测试,揭示了它们在不同输入上下文和评估维度上的性能表现。此外,我们对最先进的自动摘要评估器进行了全面比较。我们的基准数据将在 https://github.com/DISL-Lab/UniSumEval-v1.0 公开。