Large Language Models (LLMs) have succeeded remarkably in understanding long-form contents. However, exploring their capability for generating long-form contents, such as reports and articles, has been relatively unexplored and inadequately assessed by existing benchmarks. The prevalent evaluation methods, which predominantly rely on crowdsourcing, are recognized for their labor-intensive nature and lack of efficiency, whereas automated metrics, such as the ROUGE score, demonstrate discordance with human judgment criteria. In this paper, we propose ProxyQA, an innovative framework dedicated to assessing long-text generation. ProxyQA comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. LLMs are tasked to generate extensive content in response to these meta-questions, by engaging an evaluator and incorporating the generated texts as contextual background, ProxyQA assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions. We examine multiple LLMs, emphasizing ProxyQA's demanding nature as a high-quality assessment tool. Human evaluation demonstrates that the proxy-question method is notably self-consistent and aligns closely with human evaluative standards. The dataset and leaderboard is available at \url{https://proxy-qa.com}.
翻译:大语言模型(LLMs)在理解长文本内容方面取得了显著成功。然而,探索其在生成长文本内容(如报告和文章)方面的能力仍相对不足,且现有基准评估尚不充分。当前主流的评估方法主要依赖众包,普遍被认为劳动密集且效率低下,而自动评估指标(如ROUGE分数)则与人类判断标准存在偏差。本文提出ProxyQA,一个专用于评估长文本生成的创新框架。ProxyQA包含涵盖多个领域、由人工精心设计的元问题,每个元问题均附带具有预标注答案的具体代理问题。我们要求LLMs针对这些元问题生成详尽内容,通过引入评估器并将生成文本作为上下文背景,ProxyQA依据评估器回答代理问题的准确率来评判生成内容的质量。我们测试了多种LLMs,结果凸显了ProxyQA作为高质量评估工具的严苛性。人工评估表明,代理问题方法具有显著的自洽性,且与人类评估标准高度吻合。数据集与排行榜已公开于 \url{https://proxy-qa.com}。