Large Language Models (LLMs) have exhibited remarkable success in long-form context comprehension tasks. However, their capacity to generate long contents, such as reports and articles, remains insufficiently explored. Current benchmarks do not adequately assess LLMs' ability to produce informative and comprehensive content, necessitating a more rigorous evaluation approach. In this study, we introduce \textsc{ProxyQA}, a framework for evaluating long-form text generation, comprising in-depth human-curated \textit{meta-questions} spanning various domains. Each meta-question contains corresponding \textit{proxy-questions} with annotated answers. LLMs are prompted to generate extensive content in response to these meta-questions. Utilizing an evaluator and incorporating generated content as background context, \textsc{ProxyQA} evaluates the quality of generated content based on the evaluator's performance in answering the \textit{proxy-questions}. We examine multiple LLMs, emphasizing \textsc{ProxyQA}'s demanding nature as a high-quality assessment tool. Human evaluation demonstrates that evaluating through \textit{proxy-questions} is a highly self-consistent and human-criteria-correlated validation method. The dataset and leaderboard will be available at \url{https://github.com/Namco0816/ProxyQA}.
翻译:大语言模型(LLMs)在长文本上下文理解任务中展现出显著成功。然而,其在生成报告、文章等长内容方面的能力仍未得到充分探索。现有基准测试未能充分评估LLMs生成信息丰富且全面内容的能力,亟需更严格的评估方法。本研究提出\textsc{ProxyQA}框架,一种评估长文本生成的体系,包含由人工精心构造的跨领域深度\textit{元问题}。每个元问题对应带有标注答案的\textit{代理问题}。LLMs被提示针对这些元问题生成长篇内容。通过引入评估器并将生成内容作为背景上下文,\textsc{ProxyQA}基于评估器回答\textit{代理问题}的表现衡量生成内容质量。我们检验了多个LLMs,凸显\textsc{ProxyQA}作为高质量评估工具的严苛性。人类评估表明,通过\textit{代理问题}进行评估是一种高度自洽且与人类标准相关的验证方法。数据集与排行榜将发布于\url{https://github.com/Namco0816/ProxyQA}。