This paper presents the work of restoring punctuation for ASR transcripts generated by multilingual ASR systems. The focus languages are English, Mandarin, and Malay which are three of the most popular languages in Singapore. To the best of our knowledge, this is the first system that can tackle punctuation restoration for these three languages simultaneously. Traditional approaches usually treat the task as a sequential labeling task, however, this work adopts a slot-filling approach that predicts the presence and type of punctuation marks at each word boundary. The approach is similar to the Masked-Language Model approach employed during the pre-training stages of BERT, but instead of predicting the masked word, our model predicts masked punctuation. Additionally, we find that using Jieba1 instead of only using the built-in SentencePiece tokenizer of XLM-R can significantly improve the performance of punctuating Mandarin transcripts. Experimental results on English and Mandarin IWSLT2022 datasets and Malay News show that the proposed approach achieved state-of-the-art results for Mandarin with 73.8% F1-score while maintaining a reasonable F1-score for English and Malay, i.e. 74.7% and 78% respectively. Our source code that allows reproducing the results and building a simple web-based application for demonstration purposes is available on Github.
翻译:本文介绍了针对多语言自动语音识别系统生成的转录文本进行标点恢复的工作。研究聚焦于新加坡最常用的三种语言:英语、普通话和马来语。据我们所知,这是首个能同时处理这三种语言标点恢复任务的系统。传统方法通常将该任务视为序列标注问题,而本研究采用槽填充方法,在词边界处预测标点符号的存在与类型。该方法类似于BERT预训练阶段采用的掩码语言模型方法,但我们的模型并非预测被掩码的词汇,而是预测被掩码的标点符号。此外,我们发现对于普通话转录文本,使用Jieba分词器而非仅依赖XLM-R内置的SentencePiece分词器能显著提升标点恢复性能。在英语和普通话IWSLT2022数据集及马来语新闻数据上的实验结果表明,所提方法在普通话上取得了73.8% F1分数的先进性能,同时在英语和马来语上分别保持了74.7%和78%的合理F1分数。我们已在Github开源相关代码,可用于复现实验结果及构建简易的演示性网页应用。