One of the fundamental representation learning tasks is unsupervised sequential disentanglement, where latent codes of inputs are decomposed to a single static factor and a sequence of dynamic factors. To extract this latent information, existing methods condition the static and dynamic codes on the entire input sequence. Unfortunately, these models often suffer from information leakage, i.e., the dynamic vectors encode both static and dynamic information, or vice versa, leading to a non-disentangled representation. Attempts to alleviate this problem via reducing the dynamic dimension and auxiliary loss terms gain only partial success. Instead, we propose a novel and simple architecture that mitigates information leakage by offering a simple and effective subtraction inductive bias while conditioning on a single sample. Remarkably, the resulting variational framework is simpler in terms of required loss terms, hyperparameters, and data augmentation. We evaluate our method on multiple data-modality benchmarks including general time series, video, and audio, and we show beyond state-of-the-art results on generation and prediction tasks in comparison to several strong baselines.
翻译:无监督序列解耦是表征学习的基本任务之一,其目标是将输入数据的潜在编码分解为一个静态因子和一个动态因子序列。为提取此类潜在信息,现有方法通常将静态与动态编码以整个输入序列为条件进行建模。然而,这些模型常面临信息泄漏问题,即动态向量同时编码了静态与动态信息,或反之亦然,导致表征未能实现有效解耦。通过降低动态维度或引入辅助损失项来缓解此问题的尝试仅取得部分成功。为此,我们提出一种新颖而简洁的架构,该架构通过对单个样本施加条件约束,并引入简洁有效的减法归纳偏置,从而显著缓解信息泄漏问题。值得注意的是,所得变分框架在所需损失项、超参数和数据增强方面均更为简化。我们在包括通用时间序列、视频和音频在内的多模态基准数据集上评估了所提方法,并在生成与预测任务中与多个强基线模型对比,展示了超越现有最优水平的性能表现。