Using the notion of conservative gradient, we provide a simple model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs. The overhead complexity of the backward mode turns out to be independent of the dimension when using programs with locally Lipschitz semi-algebraic or definable elementary functions. This considerably extends Baur-Strassen's smooth cheap gradient principle. We illustrate our results by establishing fast backpropagation results of conservative gradients through feedforward neural networks with standard activation and loss functions. Nonsmooth backpropagation's cheapness contrasts with concurrent forward approaches, which have, to this day, dimensional-dependent worst-case overhead estimates. We provide further results suggesting the superiority of backward propagation of conservative gradients. Indeed, we relate the complexity of computing a large number of directional derivatives to that of matrix multiplication, and we show that finding two subgradients in the Clarke subdifferential of a function is an NP-hard problem.
翻译:利用保守梯度的概念,我们提出了一个简单模型,用于估计一类广泛的非光滑程序中算法微分的反向模式和正向模式的计算成本。当使用局部Lipschitz半代数或可定义初等函数时,反向模式的附加复杂度与维度无关。这大大扩展了Baur-Strassen光滑廉价梯度原理。我们通过在前馈神经网络中建立标准激活函数和损失函数的保守梯度快速反向传播结果,验证了我们的结论。非光滑反向传播的廉价性与当前正向方法形成对比,后者至今仍具有依赖于维度的最坏情况附加估计。我们进一步提供了支持保守梯度反向传播优越性的结果。实际上,我们将计算大量方向导数的复杂度与矩阵乘法相关联,并证明了在Clarke次微分中找到一个函数的两个次梯度是NP难题。