Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It enables learning the relations between input and output sequences, termed alignments, by marginalizing over perfect alignments (that yield the ground truth), at the expense of imperfect alignments. This binary differentiation of perfect and imperfect alignments falls short of capturing other essential alignment properties that hold significance in other real-world applications. Here we propose $\textit{Align With Purpose}$, a $\textbf{general Plug-and-Play framework}$ for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC with an additional loss term that prioritizes alignments according to a desired property. Our method does not require any intervention in the CTC loss function, enables easy optimization of a variety of properties, and allows differentiation between both perfect and imperfect alignments. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: emission time and word error rate (WER). For the former, we report an improvement of up to 570ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5% WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on a scale of data as large as ours. Notably, our method can be implemented using only a few lines of code, and can be extended to other alignment-free loss functions and to domains other than ASR.

翻译：连接主义时序分类（CTC）是一种广泛用于训练监督序列到序列（seq2seq）模型的标准准则。它通过边缘化完美对齐（即生成真实标注的对齐）来学习输入与输出序列之间的关系（称为对齐），同时以牺牲不完美对齐为代价。这种完美对齐与不完美对齐的二元区分无法捕捉其他在真实世界应用中具有重要意义的对齐属性。本文提出$\textit{按需对齐}$（Align With Purpose），一个**通用的即插即用框架**，用于增强基于CTC准则训练的模型中的期望属性。我们通过为CTC补充一个额外的损失项来实现这一目标，该损失项根据期望属性优先选择对齐。我们的方法无需干预CTC损失函数本身，即可轻松优化多种属性，并支持对完美对齐与不完美对齐的差异化处理。我们在自动语音识别（ASR）领域应用该框架，展示了其在属性选择、架构选择及训练数据集规模（高达28万小时）方面的通用性。为证明框架的有效性，我们将其应用于两个不相关的属性：发射时间与词错误率（WER）。前者方面，我们在WER仅有微小降低的情况下实现了高达570毫秒的延迟优化提升；后者方面，相对于基线模型，WER取得了4.5%的相对改进。据我们所知，这些应用从未在如此大规模数据上得到验证。值得注意的是，我们的方法仅需数行代码即可实现，并可扩展至其他免对齐损失函数及ASR之外的领域。