Carefully designed activation functions can improve the performance of neural networks in many machine learning tasks. However, it is difficult for humans to construct optimal activation functions, and current activation function search algorithms are prohibitively expensive. This paper aims to improve the state of the art through three steps: First, the benchmark datasets Act-Bench-CNN, Act-Bench-ResNet, and Act-Bench-ViT were created by training convolutional, residual, and vision transformer architectures from scratch with 2,913 systematically generated activation functions. Second, a characterization of the benchmark space was developed, leading to a new surrogate-based method for optimization. More specifically, the spectrum of the Fisher information matrix associated with the model's predictive distribution at initialization and the activation function's output distribution were found to be highly predictive of performance. Third, the surrogate was used to discover improved activation functions in several real-world tasks, with a surprising finding: a sigmoidal design that outperformed all other activation functions was discovered, challenging the status quo of always using rectifier nonlinearities in deep learning. Each of these steps is a contribution in its own right; together they serve as a practical and theoretical foundation for further research on activation function optimization.
翻译:精心设计的激活函数能够提升神经网络在众多机器学习任务中的性能表现。然而,人类难以构造出最优的激活函数,且现有激活函数搜索算法计算成本过高。本文通过三个步骤推动该领域发展:首先,基于2,913个系统生成的激活函数从头训练卷积架构、残差架构和视觉Transformer架构,构建了Act-Bench-CNN、Act-Bench-ResNet和Act-Bench-ViT基准数据集;其次,通过刻画基准空间特征,提出了一种基于代理的优化新方法——具体而言,模型初始预测分布对应的Fisher信息矩阵谱与激活函数输出分布对性能具有高度预测性;最后,利用该代理方法在多项实际任务中发现改进的激活函数,并得到令人意外的发现:一种S形设计方案在性能上超越所有其他激活函数,挑战了深度学习领域始终采用整流型非线性的传统认知。上述各步骤均具有独立贡献,共同为激活函数优化的后续研究奠定了实践与理论基础。