The dynamical variational autoencoders (DVAEs) are a family of latent-variable deep generative models that extends the VAE to model a sequence of observed data and a corresponding sequence of latent vectors. In almost all the DVAEs of the literature, the temporal dependencies within each sequence and across the two sequences are modeled with recurrent neural networks. In this paper, we propose to model speech signals with the Hierarchical Transformer DVAE (HiT-DVAE), which is a DVAE with two levels of latent variable (sequence-wise and frame-wise) and in which the temporal dependencies are implemented with the Transformer architecture. We show that HiT-DVAE outperforms several other DVAEs for speech spectrogram modeling, while enabling a simpler training procedure, revealing its high potential for downstream low-level speech processing tasks such as speech enhancement.
翻译:动态变分自编码器(DVAEs)是一类扩展VAE的潜变量深度生成模型,用于对观测数据序列及其对应的潜变量向量序列进行建模。在现有文献中的绝大多数DVAEs中,各序列内部以及跨序列的时间依赖关系均通过循环神经网络建模。本文提出利用层级Transformer动态变分自编码器(HiT-DVAE)对语音信号进行建模——该模型包含两个层次(序列级与帧级)的潜变量,并采用Transformer架构实现时间依赖关系。实验表明,HiT-DVAE在语音频谱建模任务中优于多种其他DVAEs,同时实现了更简化的训练流程,展现出其在语音增强等下游低层语音处理任务中的巨大潜力。