Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It enables learning the relations between input and output sequences, termed alignments, by marginalizing over perfect alignments (that yield the ground truth), at the expense of imperfect alignments. This binary differentiation of perfect and imperfect alignments falls short of capturing other essential alignment properties that hold significance in other real-world applications. Here we propose $\textit{Align With Purpose}$, a $\textbf{general Plug-and-Play framework}$ for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC with an additional loss term that prioritizes alignments according to a desired property. Our method does not require any intervention in the CTC loss function, enables easy optimization of a variety of properties, and allows differentiation between both perfect and imperfect alignments. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: emission time and word error rate (WER). For the former, we report an improvement of up to 570ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5% WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on a scale of data as large as ours. Notably, our method can be implemented using only a few lines of code, and can be extended to other alignment-free loss functions and to domains other than ASR.

翻译：连接主义时序分类（CTC）是一种广泛用于监督式序列到序列（seq2seq）模型训练的标准准则。该准则通过边缘化完美对齐（即产生真实标签的对齐）并舍弃非完美对齐，学习输入与输出序列之间的关系（称为对齐）。这种对完美与非完美对齐的二元区分，无法捕捉其他在现实应用中具有重要意义的对齐属性。本文提出**"面向目标对齐"**，一种**通用即插即用框架**，用于增强基于CTC准则训练的模型中的期望属性。我们通过在CTC准则基础上补充一个额外损失项来实现这一目标，该损失项根据期望属性对对齐进行优先级排序。我们的方法无需对CTC损失函数进行任何干预，能够轻松优化多种属性，并区分完美与非完美对齐。我们将该框架应用于自动语音识别（ASR）领域，并展示了其在属性选择、架构设计以及训练数据集规模（高达28万小时）方面的通用性。为验证框架的有效性，我们将其应用于两个不相关的属性：发射时刻与词错误率（WER）。对于前者，我们在WER轻微降低的情况下实现了高达570毫秒的延迟优化；对于后者，我们报告了相较于基线模型4.5%的WER相对改进。据我们所知，这些应用此前从未在如此大规模的数据上得到验证。值得一提的是，我们的方法仅需数行代码即可实现，并可扩展至其他无需对齐的损失函数及ASR之外的领域。