What makes waveform-based deep learning so hard? Despite numerous attempts at training convolutional neural networks (convnets) for filterbank design, they often fail to outperform hand-crafted baselines. These baselines are linear time-invariant systems: as such, they can be approximated by convnets with wide receptive fields. Yet, in practice, gradient-based optimization leads to suboptimal approximations. In our article, we approach this phenomenon from the perspective of initialization. We present a theory of large deviations for the energy response of FIR filterbanks with random Gaussian weights. We find that deviations worsen for large filters and locally periodic input signals, which are both typical for audio signal processing applications. Numerical simulations align with our theory and suggest that the condition number of a convolutional layer follows a logarithmic scaling law between the number and length of the filters, which is reminiscent of discrete wavelet bases.
翻译:是什么使得基于波形的深度学习如此困难?尽管在滤波器组设计中训练卷积神经网络(convnets)的尝试众多,但其表现往往无法超越手工设计的基线模型。这些基线模型是线性时不变系统:因此,它们可以通过具有宽感受野的卷积网络来逼近。然而在实际中,基于梯度的优化会导致次优的近似结果。本文从初始化角度研究这一现象。我们提出了关于随机高斯权重有限脉冲响应(FIR)滤波器组能量响应的大偏差理论。研究发现,对于大尺度滤波器以及局部周期性的输入信号(这两者在音频信号处理应用中均具有典型性),偏差会显著恶化。数值模拟与我们的理论一致,表明卷积层的条件数在滤波器数量与长度之间遵循对数标度律——这一特性令人联想到离散小波基。