Aligned audio corpora are fundamental to NLP technologies such as ASR and speech translation, yet they remain scarce for underrepresented languages, hindering their technological integration. This paper introduces a methodology for constructing LoReSpeech, a low-resource speech-to-speech translation corpus. Our approach begins with LoReASR, a sub-corpus of short audios aligned with their transcriptions, created through a collaborative platform. Building on LoReASR, long-form audio recordings, such as biblical texts, are aligned using tools like the MFA. LoReSpeech delivers both intra- and inter-language alignments, enabling advancements in multilingual ASR systems, direct speech-to-speech translation models, and linguistic preservation efforts, while fostering digital inclusivity. This work is conducted within Tutlayt AI project (https://tutlayt.fr).
翻译:对齐的音频语料库是自动语音识别(ASR)和语音翻译等自然语言处理技术的基础,但对于代表性不足的语言而言,此类资源仍然稀缺,阻碍了这些语言的技术整合。本文介绍了一种构建低资源语音到语音翻译语料库LoReSpeech的方法。我们的方法始于LoReASR——一个通过协作平台创建的、由短音频及其转录文本对齐组成的子语料库。基于LoReASR,我们使用MFA等工具对齐了长音频录音(例如圣经文本)。LoReSpeech提供了语种内和跨语种的对齐数据,能够推动多语言ASR系统、直接语音到语音翻译模型的发展,并支持语言保护工作,同时促进数字包容性。这项工作是在Tutlayt AI项目(https://tutlayt.fr)框架内进行的。