This preprint describes work in progress on LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages. LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced. We describe our process for extracting and filtering the dataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022). The source data is public domain newswire collected from from Voice of America websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0), making it one of the most openly-licensed multilingual summarization datasets. We describe how we plan to use the data for modeling experiments and discuss limitations of the dataset.
翻译:本预印本介绍了LR-Sum数据集的研究进展,该数据集采用宽松许可协议,旨在推动低资源语言自动摘要领域的研究。LR-Sum包含40种语言的人工撰写摘要,其中多数为低资源语言。我们详细阐述了从多语言开放文本语料库(Palen-Michel等,2022)中提取并过滤该数据集的方法。源数据源自美国之音网站的公共领域新闻稿,LR-Sum以知识共享许可协议(CC BY 4.0)发布,成为许可协议最开放的多语言摘要数据集之一。我们说明了计划如何将该数据用于建模实验,并讨论了数据集的局限性。