It presents significant challenges to generate comprehensive and accurate Wikipedia articles for newly emerging events under a real-world scenario. Existing attempts fall short either by focusing only on short snippets or by using metrics that are insufficient to evaluate real-world scenarios. In this paper, we construct WIKIGENBENCH, a new benchmark consisting of 1,320 entries, designed to align with real-world scenarios in both generation and evaluation. For generation, we explore a real-world scenario where structured, full-length Wikipedia articles with citations are generated for new events using input documents from web sources. For evaluation, we integrate systematic metrics and LLM-based metrics to assess the verifiability, organization, and other aspects aligned with real-world scenarios. Based on this benchmark, we conduct extensive experiments using various models within three commonly used frameworks: direct RAG, hierarchical structure-based RAG, and RAG with a fine-tuned generation model. Experimental results show that hierarchical-based methods can generate more comprehensive content, while fine-tuned methods achieve better verifiability. However, even the best methods still show a significant gap compared to existing Wikipedia content, indicating that further research is necessary.
翻译:在真实场景下为新兴事件生成全面且准确的维基百科文章是一项重大挑战。现有尝试要么仅关注短片段生成,要么使用的评估指标不足以衡量真实场景。本文构建了WIKIGENBENCH,一个包含1,320个条目的新基准,旨在在生成与评估两方面均与真实场景对齐。在生成方面,我们探索了一个真实场景:利用来自网络来源的输入文档,为新兴事件生成带有引用的结构化、完整长度的维基百科文章。在评估方面,我们整合了系统性指标和基于大语言模型(LLM)的指标,以评估可验证性、组织性及其他与真实场景对齐的方面。基于此基准,我们在三种常用框架下使用多种模型进行了广泛实验:直接检索增强生成(RAG)、基于层次化结构的RAG以及结合微调生成模型的RAG。实验结果表明,基于层次化结构的方法能生成更全面的内容,而微调方法则实现了更好的可验证性。然而,即使最佳方法仍与现有维基百科内容存在显著差距,这表明仍需进一步的研究。