We propose a novel approach for time-scale modification of audio signals. Unlike traditional methods that rely on the framing technique or the short-time Fourier transform to preserve the frequency during temporal stretching, our neural network model encodes the raw audio into a high-level latent representation, dubbed Neuralgram, where each vector represents 1024 audio sample points. Due to a sufficient compression ratio, we are able to apply arbitrary spatial interpolation of the Neuralgram to perform temporal stretching. Finally, a learned neural decoder synthesizes the time-scaled audio samples based on the stretched Neuralgram representation. Both the encoder and decoder are trained with latent regression losses and adversarial losses in order to obtain high-fidelity audio samples. Despite its simplicity, our method has comparable performance compared to the existing baselines and opens a new possibility in research into modern time-scale modification. Audio samples can be found at https://tsmnet-mmasia23.github.io
翻译:我们提出了一种新颖的音频信号时间尺度修改方法。与传统方法依赖分帧技术或短时傅里叶变换在时域拉伸过程中保持频率不同,我们的神经网络模型将原始音频编码为高层潜在表示,称为Neuralgram,其中每个向量代表1024个音频采样点。凭借充分的压缩比,我们能够对Neuralgram进行任意空间插值以实现时域拉伸。最后,经过学习的神经解码器基于拉伸后的Neuralgram表示合成时间缩放后的音频样本。编码器和解码器均通过潜在回归损失和对抗损失联合训练,以获得高保真音频样本。尽管方法简洁,但我们的性能与现有基线方法相当,为现代时间尺度修改研究开辟了新路径。音频样本详见 https://tsmnet-mmasia23.github.io