Random feature model with a nonlinear activation function has been shown to perform asymptotically equivalent to a Gaussian model in terms of training and generalization errors. Analysis of the equivalent model reveals an important yet not fully understood role played by the activation function. To address this issue, we study the "parameters" of the equivalent model to achieve improved generalization performance for a given supervised learning problem. We show that acquired parameters from the Gaussian model enable us to define a set of optimal nonlinearities. We provide two example classes from this set, e.g., second-order polynomial and piecewise linear functions. These functions are optimized to improve generalization performance regardless of the actual form. We experiment with regression and classification problems, including synthetic and real (e.g., CIFAR10) data. Our numerical results validate that the optimized nonlinearities achieve better generalization performance than widely-used nonlinear functions such as ReLU. Furthermore, we illustrate that the proposed nonlinearities also mitigate the so-called double descent phenomenon, which is known as the non-monotonic generalization performance regarding the sample size and the model size.
翻译:随机特征模型配合非线性激活函数在训练误差与泛化误差方面已被证明与高斯模型渐进等价。对等效模型的分析揭示了激活函数的重要作用,但其机制尚未完全明晰。针对该问题,我们研究等效模型的"参数"以在给定监督学习问题中实现更优泛化性能。研究表明,从高斯模型中获取的参数使我们能够定义一组最优非线性函数。我们给出了该集合中的两类示例函数:二阶多项式函数和分段线性函数。这些函数通过优化设计,其泛化性能的改进与实际函数形式无关。我们进行了包括合成数据与真实数据(如CIFAR10)在内的回归与分类实验。数值结果证实,优化后的非线性函数相比ReLU等广泛使用的非线性函数取得了更优的泛化性能。进一步地,我们证明所提出的非线性函数还能缓解所谓的"双下降"现象——即泛化性能随样本量与模型规模呈现非单调变化的特性。