Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity. Interlinear Glossed Text (IGT) is a form of linguistic annotation that can support documentation and resource creation for these languages' communities. IGT typically consists of (1) transcriptions, (2) morphological segmentation, (3) glosses, and (4) free translations to a majority language. We propose Wav2Gloss: a task to extract these four annotation components automatically from speech, and introduce the first dataset to this end, Fieldwork: a corpus of speech with all these annotations covering 37 languages with standard formatting and train/dev/test splits. We compare end-to-end and cascaded Wav2Gloss methods, with analysis suggesting that pre-trained decoders assist with translation and glossing, that multi-task and multilingual approaches are underperformant, and that end-to-end systems perform better than cascaded systems, despite the text-only systems' advantages. We provide benchmarks to lay the ground work for future research on IGT generation from speech.
翻译:全球数千种语言正面临灭绝的危险——这对文化认同和人类语言多样性构成了巨大威胁。行间注音文本(IGT)是一种语言注释形式,可为这些语言社区的语言记录和资源建设提供支持。IGT通常包括(1)转写、(2)词素切分、(3)注释以及(4)向主体语言的自译。我们提出Wav2Gloss任务:从语音中自动提取这四种注释成分,并为此引入首个数据集Fieldwork:一个包含所有上述注释的语音语料库,覆盖37种语言,采用标准格式以及训练/开发/测试集划分。我们比较了端到端与级联式Wav2Gloss方法,分析表明预训练解码器有助于翻译与注释,多任务与多语言方法表现欠佳,且端到端系统优于级联系统,尽管纯文本系统具有优势。我们提供基准测试,为未来基于语音的IGT生成研究奠定基础。