MegaWika: Millions of reports and their sources across 50 diverse languages

Samuel Barham,Orion Weller,Michelle Yuan,Kenton Murray,Mahsa Yarmohammadi,Zhengping Jiang,Siddharth Vashishtha,Alexander Martin,Anqi Liu,Aaron Steven White,Jordan Boyd-Graber,Benjamin Van Durme

from arxiv, Submitted to ACL, 2023

To foster the development of new models for collaborative AI-assisted report generation, we introduce MegaWika, consisting of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, going beyond the initial Wikipedia citation extraction and web scraping of content, including translating non-English articles for cross-lingual applications and providing FrameNet parses for automated semantic analysis. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual. We manually analyze the quality of this resource through a semantically stratified sample. Finally, we provide baseline results and trained models for crucial steps in automated report generation: cross-lingual question answering and citation retrieval.

翻译：为促进协作式AI辅助报告生成新模型的发展，我们推出MegaWika数据集。该数据集包含50种不同语言的1300万篇维基百科文章及其7100万份引用源材料。我们对这一数据集进行了多维度处理，除初始的维基百科引用提取与网络内容抓取外，还包括：非英语文章的跨语言翻译、以及用于自动化语义分析的FrameNet解析。MegaWika是当前规模最大的句子级报告生成资源，也是唯一的多语言报告生成数据集。我们通过语义分层抽样对该资源质量进行了人工分析。最后，我们为自动化报告生成的关键环节——跨语言问答与引文检索——提供了基线实验结果及预训练模型。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日