Is the lottery ticket phenomenon an idiosyncrasy of gradient-based training or does it generalize to evolutionary optimization? In this paper we establish the existence of highly sparse trainable initializations for evolution strategies (ES) and characterize qualitative differences compared to gradient descent (GD)-based sparse training. We introduce a novel signal-to-noise iterative pruning procedure, which incorporates loss curvature information into the network pruning step. This can enable the discovery of even sparser trainable network initializations when using black-box evolution as compared to GD-based optimization. Furthermore, we find that these initializations encode an inductive bias, which transfers across different ES, related tasks and even to GD-based training. Finally, we compare the local optima resulting from the different optimization paradigms and sparsity levels. In contrast to GD, ES explore diverse and flat local optima and do not preserve linear mode connectivity across sparsity levels and independent runs. The results highlight qualitative differences between evolution and gradient-based learning dynamics, which can be uncovered by the study of iterative pruning procedures.
翻译:彩票现象是梯度训练的独特特征,还是能推广至进化优化?本文证实了进化策略中高度稀疏可训练初始化的存在,并刻画了其与基于梯度下降的稀疏训练的本质差异。我们提出一种新型信噪比迭代剪枝方法,通过将损失曲率信息融入网络剪枝步骤,可在采用黑盒进化优化时发现比GD优化更稀疏的可训练网络初始化。进一步发现,这些初始化编码了可迁移至不同ES算法、相关任务甚至GD训练的归纳偏置。最后,我们比较了不同优化范式与稀疏度下产生的局部最优解。与GD不同,ES探索多样平坦的局部最优解,且在不同稀疏度与独立运行间不保持线性模式连通性。这些结果凸显了进化与梯度学习动力学的本质差异,而迭代剪枝过程的研究正可揭示这些差异。