Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC's token offset timestamps by 12-40% in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with the most widely used FA toolkit Montreal Forced Aligner (MFA), our method performs similarly on PBE/WBE on Buckeye, yet falls behind MFA on TIMIT. Nevertheless, our method has a much simpler training pipeline and better runtime efficiency. Our training recipe and pretrained model are released in TorchAudio.
翻译:连接时序分类(CTC)模型已知具有峰值化的输出分布特性。这种行为对于自动语音识别(ASR)而言并非问题,但在进行强制对齐(FA)时可能导致不准确的结果,尤其是在更细的粒度(如音素级别)上。本文旨在通过利用标签先验来缓解CTC的峰值化行为,并提升其生成强制对齐的适用性,具体方法是在训练过程中提升并最大化包含较少空白标签的对齐路径分数。因此,我们的CTC模型产生峰值性较低的后验概率,并能够更准确地预测标记的结束时间点(偏移量)以及起始时间点(起始点)。在Buckeye和TIMIT数据集上测量的音素与词边界错误率(PBE和WBE)指标中,本方法优于标准CTC模型以及基于启发式方法获取CTC标记偏移时间戳的方法,提升幅度达12-40%。与目前最广泛使用的强制对齐工具包Montreal Forced Aligner(MFA)相比,在Buckeye数据集上本方法的PBE/WBE表现与MFA相当,但在TIMIT数据集上略逊于MFA。尽管如此,本方法具有更简洁的训练流程和更优的运行时效率。我们的训练方案与预训练模型已在TorchAudio中发布。