What makes waveform-based deep learning so hard? Despite numerous attempts at training convolutional neural networks (convnets) for filterbank design, they often fail to outperform hand-crafted baselines. These baselines are linear time-invariant systems: as such, they can be approximated by convnets with wide receptive fields. Yet, in practice, gradient-based optimization leads to suboptimal approximations. In our article, we approach this phenomenon from the perspective of initialization. We present a theory of large deviations for the energy response of FIR filterbanks with random Gaussian weights. We find that deviations worsen for large filters and locally periodic input signals, which are both typical for audio signal processing applications. Numerical simulations align with our theory and suggest that the condition number of a convolutional layer follows a logarithmic scaling law between the number and length of the filters, which is reminiscent of discrete wavelet bases.
翻译:是什么使得基于波形的深度学习如此困难?尽管在滤波器组设计方面训练卷积神经网络(convnets)的尝试众多,但它们往往无法超越手工设计的基线模型。这些基线模型是线性时不变系统:因此,它们可以用具有宽感受野的卷积神经网络来近似。然而在实际中,基于梯度的优化会导致次优的近似。本文中,我们从初始化的角度研究这一现象。我们提出了一个关于随机高斯权重有限脉冲响应滤波器组能量响应的大偏差理论。我们发现,对于大尺寸滤波器和局部周期输入信号(这两者都是音频信号处理应用中的典型特征),偏差会加剧。数值模拟与我们的理论一致,并表明卷积层的条件数在滤波器数量与长度之间遵循对数尺度规律,这让人联想到离散小波基。