Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment

Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences. As a typical multi-modal task, there exists an inherent modality gap between sign language videos and spoken language text, which makes the cross-modal alignment between visual and textual modalities crucial. However, previous studies tend to rely on an intermediate sign gloss representation to help alleviate the cross-modal problem thereby neglecting the alignment across modalities that may lead to compromised results. To address this issue, we propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT) that facilitates direct and sufficient cross-modal alignment between sign language videos and spoken language text. Specifically, our CV-SLT consists of two paths with two Kullback-Leibler (KL) divergences to regularize the outputs of the encoder and decoder, respectively. In the prior path, the model solely relies on visual information to predict the target text; whereas in the posterior path, it simultaneously encodes visual information and textual knowledge to reconstruct the target text. The first KL divergence optimizes the conditional variational autoencoder and regularizes the encoder outputs, while the second KL divergence performs a self-distillation from the posterior path to the prior path, ensuring the consistency of decoder outputs. We further enhance the integration of textual information to the posterior path by employing a shared Attention Residual Gaussian Distribution (ARGD), which considers the textual information in the posterior path as a residual component relative to the prior path. Extensive experiments conducted on public datasets (PHOENIX14T and CSL-daily) demonstrate the effectiveness of our framework, achieving new state-of-the-art results while significantly alleviating the cross-modal representation discrepancy.

翻译：手语翻译（SLT）旨在将连续手语视频转换为文本语句。作为典型的多模态任务，手语视频与口语文本之间存在固有的模态差异，这使得视觉模态与文本模态的跨模态对齐至关重要。然而，以往研究往往依赖中间的手语标注表示来缓解跨模态问题，从而忽视了对齐过程，可能导致翻译效果欠佳。为解决这一问题，我们提出一种基于条件变分自编码器的新型SLT框架（CV-SLT），该框架能够促进手语视频与口语文本之间直接且充分的跨模态对齐。具体而言，我们的CV-SLT包含两条路径，通过两个库尔贝克-莱布勒（KL）散度分别对编码器和解码器的输出进行正则化约束。在先验路径中，模型仅依赖视觉信息预测目标文本；而在后验路径中，模型同时编码视觉信息和文本知识以重构目标文本。第一个KL散度用于优化条件变分自编码器并对编码器输出进行正则化，第二个KL散度则实现从后验路径到先验路径的自蒸馏，确保解码器输出的一致性。我们进一步通过共享的注意力残差高斯分布（ARGD）增强文本信息向后验路径的融合，将后验路径中的文本信息视为相对先验路径的残差成分。在公开数据集（PHOENIX14T和CSL-daily）上进行的大量实验证明了我们框架的有效性，在显著缓解跨模态表征差异的同时取得了新的最优结果。