Training Non-Differentiable Networks via Optimal Transport

Neural networks increasingly embed non-differentiable components (spiking neurons, quantized layers, discrete routing, blackbox simulators, etc.) where backpropagation is inapplicable and surrogate gradients introduce bias. We present PolyStep, a gradient-free optimizer that updates parameters using only forward passes. Each step evaluates the loss at structured polytope vertices in a compressed subspace, computes softmax-weighted assignments over the resulting cost matrix, and displaces particles toward low-cost vertices via barycentric projection. This update corresponds to the one-sided limit of a regularized optimal-transport problem, inheriting its geometric structure without Sinkhorn iterations. PolyStep trains genuinely non-differentiable models where existing gradient-free methods collapse to near-random accuracy. On hard-LIF spiking networks we reach 93.4% test accuracy, outperforming all gradient-free baselines by over 60~pp and closing to within 4.4~pp of a surrogate-gradient Adam ceiling. Across four additional non-differentiable architectures (int8 quantization, argmax attention, staircase activations, hard MoE routing) we lead every gradient-free competitor. On MAX-SAT scaling from 100 to 1M variables, we sustain above 92% clause satisfaction while evolution strategies drop 8--12~pp. On RL policy search, we match OpenAI-ES on classical control and retain performance under integer and binary quantization that collapses gradient-based methods. We prove convergence to conservative-stationary points at rate $O(\log T/\sqrt{T})$ on piecewise-smooth losses, upgraded to Clarke-stationary on the headline architectures and extended to the piecewise-constant regime via a hitting-time bound. These rates match the known zeroth-order query-complexity lower bounds that all forward-only methods inherit. Code is available at https://github.com/anindex/polystep.

翻译：神经网络日益嵌入不可微组件（如脉冲神经元、量化层、离散路由、黑盒模拟器等），在这些组件中反向传播不再适用，而代理梯度会引入偏差。我们提出PolyStep——一种仅使用前向传播更新参数的无梯度优化器。每一步在压缩子空间中的结构化多面体顶点处评估损失，对所得代价矩阵计算softmax加权分配，并通过重心投影将粒子向低代价顶点移动。该更新对应正则化最优输运问题的单侧极限，继承了其几何结构而无需Sinkhorn迭代。PolyStep能够训练现有无梯度方法退化至接近随机精度的真正不可微模型。在硬LIF脉冲网络上，我们达到93.4%测试准确率，超越所有无梯度基线超过60个百分点，并逼近代理梯度Adam上限仅4.4个百分点。在另外四种不可微架构（int8量化、argmax注意力、阶梯激活、硬MoE路由）上，我们领先所有无梯度竞争者。在变量数从100扩展到100万的MAX-SAT问题中，我们保持超过92%的子句满足率，而进化策略下降8-12个百分点。在强化学习策略搜索中，我们在经典控制任务上匹配OpenAI-ES，且在整数与二进制量化（使梯度方法崩溃）下保持性能。我们证明在分段光滑损失上以$O(\log T/\sqrt{T})$速率收敛至保守稳定点，在主要架构上升级为Clarke稳定点，并通过命中时间界扩展至分段常数区域。这些速率与所有前向方法继承的已知零阶查询复杂度下界匹配。代码开源于https://github.com/anindex/polystep。