The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models.
翻译:语音维基媒体数据集是一个公开可用的音频与转录文本汇编数据集,其数据源自维基共享资源。该数据集包含1780小时(195 GB)的CC-BY-SA许可下的转录语音,涵盖77种不同语言、多样场景及发言者。每个音频文件均附带一种或多种语言的转录文本,这使得本数据集适用于训练语音识别、语音翻译及机器翻译模型。