In this work, we present a simple but effective method, CTCBERT, for advancing hidden-unit BERT (HuBERT). HuBERT applies a frame-level cross-entropy (CE) loss, which is similar to most acoustic model training. However, CTCBERT performs the model training with the Connectionist Temporal Classification (CTC) objective after removing duplicated IDs in each masked region. The idea stems from the observation that there can be significant errors in alignments when using clustered or aligned IDs. CTC learns alignments implicitly, indicating that learning with CTC can be more flexible when misalignment exists. We examine CTCBERT on IDs from HuBERT Iter1, HuBERT Iter2, and PBERT. The CTC training brings consistent improvements compared to the CE training. Furthermore, when loading blank-related parameters during finetuning, slight improvements are observed. Evaluated on the Librispeech 960-100h setting, the relative WER improvements of CTCBERT are 2%-11% over HuBERT and PERT on test-other data.
翻译:本文提出了一种简单而有效的方法——CTCBERT,用于推进隐藏单元BERT(HuBERT)。HuBERT应用帧级交叉熵(CE)损失,这与大多数声学模型训练类似。然而,CTCBERT在删除每个掩码区域中的重复ID后,使用连接时序分类(CTC)目标进行模型训练。这一思路源于观察到:当使用聚类或对齐ID时,对齐中可能存在显著误差。CTC隐式学习对齐,表明当存在对齐误差时,使用CTC进行学习可以更加灵活。我们在来自HuBERT Iter1、HuBERT Iter2和PBERT的ID上对CTCBERT进行了评估。与CE训练相比,CTC训练带来了一致的改进。此外,在微调过程中加载与空白相关的参数时,观察到了轻微的性能提升。在Librispeech 960-100h设置下评估,CTCBERT在test-other数据上的相对WER比HuBERT和PERT降低了2%至11%。