We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations, containing 600+ hours of Cantonese audio, its standard traditional Chinese transcript, and English translation, segmented and aligned at the sentence level. We describe the notable challenges in corpus preparation: segmentation, alignment of long audio recordings, and sentence-level alignment with non-verbatim transcripts. Such transcripts make the corpus suitable for speech translation research when there are significant differences between the spoken and written forms of the source language. Due to its large size, we are able to demonstrate competitive speech translation baselines on HK-LegiCoST and extend them to promising cross-corpus results on the FLEURS Cantonese subset. These results deliver insights into speech recognition and translation research in languages for which non-verbatim or ``noisy'' transcription is common due to various factors, including vernacular and dialectal speech.
翻译:我们提出了HK-LegiCoST,一个全新的粤语-英语三语平行语料库,包含600+小时的粤语音频、标准繁体中文转录文本及英文翻译,并以句子级别进行切分与对齐。本文详细描述了语料库构建中的显著挑战:分割、长音频录音的对齐,以及非逐字转录的句子级对齐。当源语言的口语与书面形式存在显著差异时,此类转录使该语料库适用于语音翻译研究。得益于其大规模数据量,我们能够在HK-LegiCoST上展示具有竞争力的语音翻译基准结果,并将其扩展至FLEURS粤语子集上的跨语料库成果。这些结果为因方言或口语化表达等因素而普遍存在非逐字或“带噪”转录的语言的语音识别与翻译研究提供了重要洞见。