We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech (MMS) model and another from a self-supervised phoneme boundary detector (UnSupSeg). It learns to fuse them and to estimate word-boundary probabilities over long temporal contexts. The alignment decoder is a learned dynamic programming that combines encoder outputs with segmental features over the MMS and UnSupSeg representations to infer final word boundaries. Trained iteratively on TIMIT and Buckeye, the proposed approach outperforms Montreal Forced Aligner (MFA) and MMS-based alignment on both datasets. On unseen languages (Dutch, German, and Hebrew), the proposed model achieves performance consistently better than or on par with existing alignment approaches, indicating its potential to scale to 1100+ languages supported by MMS without further training.
翻译:我们提出一种精确的多语言词级强制对齐方法,包含对齐编码器和学习型对齐解码器。编码器整合两种表示:一种来自大规模多语言语音(MMS)模型,另一种来自自监督音素边界检测器(UnSupSeg)。它通过学习融合这两种表示,并在长时间上下文范围内估计词边界概率。对齐解码器采用学习型动态规划,将编码器输出与基于MMS和UnSupSeg表示的片段特征相结合,以推断最终词边界。该方法在TIMIT和Buckeye数据集上迭代训练后,在两个数据集上均优于蒙特利尔强制对齐器(MFA)和基于MMS的对齐方法。在未见语言(荷兰语、德语和希伯来语)上,本模型性能始终优于或持平现有对齐方法,表明其无需额外训练即可扩展至MMS支持的1100余种语言的潜力。