To foster the development of new models for collaborative AI-assisted report generation, we introduce MegaWika, consisting of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, going beyond the initial Wikipedia citation extraction and web scraping of content, including translating non-English articles for cross-lingual applications and providing FrameNet parses for automated semantic analysis. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual. We manually analyze the quality of this resource through a semantically stratified sample. Finally, we provide baseline results and trained models for crucial steps in automated report generation: cross-lingual question answering and citation retrieval.
翻译:为促进协作式AI辅助报告生成新模型的发展,我们推出MegaWika数据集。该数据集包含50种不同语言的1300万篇维基百科文章及其7100万份引用源材料。我们对这一数据集进行了多维度处理,除初始的维基百科引用提取与网络内容抓取外,还包括:非英语文章的跨语言翻译、以及用于自动化语义分析的FrameNet解析。MegaWika是当前规模最大的句子级报告生成资源,也是唯一的多语言报告生成数据集。我们通过语义分层抽样对该资源质量进行了人工分析。最后,我们为自动化报告生成的关键环节——跨语言问答与引文检索——提供了基线实验结果及预训练模型。