Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.
翻译:生成长篇幅、信息丰富且符合事实的输出内容仍然是大型语言模型(LLMs)面临的主要挑战。现有的长文本生成基准通常通过难以验证的指标评估现实世界查询,或采用便于评估但忽略现实世界复杂性的合成设置。本文提出 \\textbf{LongWeave},通过约束验证器评估(CoV-Eval)实现现实世界相关性与可验证性评估的平衡。CoV-Eval 首先在现实场景中定义可验证的目标,然后基于这些目标系统性地生成相应的查询、文本材料及约束条件,从而确保任务既具有现实性又可客观评估,能够严格检验模型在满足复杂现实约束方面的能力。LongWeave 支持七种不同任务中可定制的输入/输出长度(最高达 64K/8K 词元)。对 23 个 LLMs 的评估表明,随着现实世界复杂性和输出长度的增加,即使是最先进的模型在长文本生成方面仍面临显著挑战。