In this paper we describe the Portuguese-language podcast dataset we have released for academic research purposes. We give an overview of how the data was sampled, descriptive statistics over the collection, as well as information about the distribution over Brazilian and Portuguese dialects. We give results from experiments on multi-lingual summarization, showing that summarizing podcast transcripts can be performed well by a system supporting both English and Portuguese. We also show experiments on Portuguese podcast genre classification using text metadata. Combining this collection with previously released English-language collection opens up the potential for multi-modal, multi-lingual and multi-dialect podcast information access research.
翻译:本文介绍了我们为学术研究目的发布的葡萄牙语播客数据集。我们概述了数据采样方式、语料库的描述性统计信息,以及关于巴西与葡萄牙方言分布的信息。我们展示了多语言摘要实验的结果,表明支持英语和葡萄牙语的系统能够有效完成播客转录文本的摘要任务。此外,我们还展示了基于文本元数据的葡萄牙语播客类型分类实验。将该语料库与先前发布的英语语料库相结合,为多模态、多语言和多方言播客信息访问研究开辟了潜在空间。