We proposed a novel approach in the field of time-scale modification on audio signals. While traditional methods use the framing technique, spectral approach uses the short-time Fourier transform to preserve the frequency during temporal stretching. TSM-Net, our neural-network model encodes the raw audio into a high-level latent representation. We call it Neuralgram, in which one vector represents 1024 audio samples. It is inspired by the framing technique but addresses the clipping artifacts. The Neuralgram is a two-dimensional matrix with real values, we can apply some existing image resizing techniques on the Neuralgram and decode it using our neural decoder to obtain the time-scaled audio. Both the encoder and decoder are trained with GANs, which shows fair generalization ability on the scaled Neuralgrams. Our method yields little artifacts and opens a new possibility in the research of modern time-scale modification. The audio samples can be found on https://ernestchu.github.io/tsm-net-demo/
翻译:我们提出了一种处理音频信号时间尺度修正问题的新方法。传统方法采用分帧技术,而频谱方法则利用短时傅里叶变换在时间拉伸过程中保留频率信息。我们的神经网络模型TSM-Net将原始音频编码为高级潜在表征,我们将其称为Neuralgram,其中每个向量代表1024个音频样本。该设计受分帧技术启发,但有效解决了拼接伪影问题。Neuralgram是一个包含实数值的二维矩阵,我们可以对其应用现有的图像缩放技术,并通过神经解码器将其解码为时间缩放后的音频。编码器和解码器均采用生成对抗网络进行训练,在缩放后的Neuralgram上展现出良好的泛化能力。我们的方法能产生极少的伪影,为现代时间尺度修正研究开辟了新途径。音频样本请访问https://ernestchu.github.io/tsm-net-demo/