This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.
翻译:本文介绍了Swivuriso,这是一个作为非洲新一代语音项目(African Next Voices Project)的一部分而开发的3000小时多语言语音数据集,旨在支持七种南非语言的自动语音识别(ASR)技术的开发与基准测试。该数据集涵盖农业、医疗保健及通用领域主题,填补了现有ASR数据集中的显著空白。我们阐述了指导数据集创建的设计原则、伦理考量和数据收集流程。本文展示了利用该数据训练/微调ASR模型的基线结果,并与相关语言的其他ASR数据集进行了比较。