The top-k operator returns a k-sparse vector, where the non-zero values correspond to the k largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-end with backpropagation. Recent works have considered differentiable relaxations, based either on regularization or perturbation techniques. However, to date, no approach is fully differentiable and sparse. In this paper, we propose new differentiable and sparse top-k operators. We view the top-k operator as a linear program over the permutahedron, the convex hull of permutations. We then introduce a p-norm regularization term to smooth out the operator, and show that its computation can be reduced to isotonic optimization. Our framework is significantly more general than the existing one and allows for example to express top-k operators that select values in magnitude. On the algorithmic side, in addition to pool adjacent violator (PAV) algorithms, we propose a new GPU/TPU-friendly Dykstra algorithm to solve isotonic optimization problems. We successfully use our operators to prune weights in neural networks, to fine-tune vision transformers, and as a router in sparse mixture of experts.
翻译:Top-k算子返回一个k稀疏向量,其中非零值对应输入中的k个最大值。然而,由于其不连续性,难以将其融入通过反向传播进行端到端训练的神经网络中。近期研究考虑了基于正则化或扰动技术的可微松弛方法,但至今尚无方法能同时实现完全可微与稀疏性。本文提出新型可微且稀疏的Top-k算子。我们将Top-k算子视为排列多面体(即排列的凸包)上的线性规划问题,进而引入p-范数正则化项平滑该算子,并证明其计算可简化为保序优化问题。我们的框架比现有方法更具通用性,例如可表达按幅度选择值的Top-k算子。在算法层面,除池化相邻违逆算法外,我们提出一种新的适用于GPU/TPU的Dykstra算法求解保序优化问题。成功将所提算子应用于神经网络权重剪枝、视觉Transformer微调及稀疏混合专家模型的路由机制。