We consider the problem of learning an arbitrarily-biased ReLU activation (or neuron) over Gaussian marginals with the squared loss objective. Despite the ReLU neuron being the basic building block of modern neural networks, we still do not understand the basic algorithmic question of whether one arbitrary ReLU neuron is learnable in the non-realizable setting. In particular, all existing polynomial time algorithms only provide approximation guarantees for the better-behaved unbiased setting or restricted bias setting. Our main result is a polynomial time statistical query (SQ) algorithm that gives the first constant factor approximation for arbitrary bias. It outputs a ReLU activation that achieves a loss of $O(\mathrm{OPT}) + \varepsilon$ in time $\mathrm{poly}(d,1/\varepsilon)$, where $\mathrm{OPT}$ is the loss obtained by the optimal ReLU activation. Our algorithm presents an interesting departure from existing algorithms, which are all based on gradient descent and thus fall within the class of correlational statistical query (CSQ) algorithms. We complement our algorithmic result by showing that no polynomial time CSQ algorithm can achieve a constant factor approximation. Together, these results shed light on the intrinsic limitation of gradient descent, while identifying arguably the simplest setting (a single neuron) where there is a separation between SQ and CSQ algorithms.
翻译:我们研究了在高斯边际分布下,以平方损失为目标学习任意偏置ReLU激活函数(或神经元)的问题。尽管ReLU神经元是现代神经网络的基本构建模块,我们仍未能从算法层面根本解决一个基础问题:在非可实现场景下,单个任意ReLU神经元是否可学习。具体而言,现有所有多项式时间算法仅能为性质更优的无偏置场景或受限偏置场景提供近似保证。我们的主要成果是提出了一种多项式时间统计查询(SQ)算法,首次实现了对任意偏置的常数因子近似。该算法输出一个ReLU激活函数,在$\mathrm{poly}(d,1/\varepsilon)$时间内达到$O(\mathrm{OPT}) + \varepsilon$的损失值,其中$\mathrm{OPT}$表示最优ReLU激活函数所获得的损失。本算法与现有基于梯度下降的算法存在显著差异——现有算法均属于相关统计查询(CSQ)算法范畴。我们在算法结果的基础上进一步证明:任何多项式时间CSQ算法均无法实现常数因子近似。这些结果共同揭示了梯度下降方法的内在局限性,同时指出了一个可能最简单的场景(单个神经元),在该场景中SQ算法与CSQ算法存在分离现象。