Connectionist temporal classification (CTC) is commonly adopted for sequence modeling tasks like speech recognition, where it is necessary to preserve order between the input and target sequences. However, CTC is only applied to deterministic sequence models, where the latent space is discontinuous and sparse, which in turn makes them less capable of handling data variability when compared to variational models. In this paper, we integrate CTC with a variational model and derive loss functions that can be used to train more generalizable sequence models that preserve order. Specifically, we derive two versions of the novel variational CTC based on two reasonable assumptions, the first being that the variational latent variables at each time step are conditionally independent; and the second being that these latent variables are Markovian. We show that both loss functions allow direct optimization of the variational lower bound for the model log-likelihood, and present computationally tractable forms for implementing them.
翻译:连接时序分类(CTC)通常用于语音识别等需要保持输入序列与目标序列顺序的序列建模任务。然而,CTC仅适用于确定性序列模型,其潜在空间不连续且稀疏,这使得它们相较于变分模型处理数据变异性的能力较弱。本文中,我们将CTC与变分模型相结合,推导出可用于训练更具泛化能力的保序序列模型的损失函数。具体而言,我们基于两个合理假设推导了两种新型变分CTC版本:第一个假设是每个时间步的变分潜变量条件独立;第二个假设是这些潜变量服从马尔可夫性。我们证明这两种损失函数均可直接优化模型对数似然的变分下界,并给出了实现它们的计算可行形式。