Audio processing methods based on deep neural networks are typically trained at a single sampling frequency (SF). To handle untrained SFs, signal resampling is commonly employed, but it can degrade performance, particularly when the input SF is lower than the trained SF. This paper investigates the causes of this degradation through two hypotheses: (i) the lack of high-frequency components introduced by up-sampling, and (ii) the greater importance of their presence than their precise representation. To examine these hypotheses, we compare conventional resampling with three alternatives: post-resampling noise addition, which adds Gaussian noise to the resampled signal; noisy-kernel resampling, which perturbs the kernel with Gaussian noise to enrich high-frequency components; and trainable-kernel resampling, which adapts the interpolation kernel through training. Experiments on music source separation show that noisy-kernel and trainable-kernel resampling alleviate the degradation observed with conventional resampling. We further demonstrate that noisy-kernel resampling is effective across diverse models, highlighting it as a simple yet practical option.
翻译:基于深度神经网络的音频处理方法通常在单一采样频率(SF)下进行训练。为处理未经训练的采样频率,通常采用信号重采样技术,但这种做法可能导致性能下降,特别是在输入采样频率低于训练采样频率的情况下。本文通过两个假设探讨了这种性能退化的成因:(i)上采样引入的高频分量缺失;(ii)高频分量的存在性比其精确表示更为重要。为检验这些假设,我们将传统重采样方法与三种替代方案进行比较:后重采样噪声添加(向重采样信号添加高斯噪声)、噪声核重采样(通过高斯噪声扰动插值核以丰富高频分量)以及可训练核重采样(通过训练自适应调整插值核)。在音乐源分离任务上的实验表明,噪声核重采样和可训练核重采样能缓解传统重采样观察到的性能退化。我们进一步证明,噪声核重采样在不同模型中均具有有效性,突显其作为一种简单而实用的解决方案的潜力。