Theoretical studies show that for any differentiable function on a compact domain, there exists a neural network that approximates both the function values and gradients. However, such a result cannot be used in practice since it assumes real parameters and exact internal operations. In contrast, real implementations only use a finite subset of reals and machine operations with round-off errors. In this work, we investigate whether a similar result holds for neural networks under floating-point arithmetic, when the gradient with respect to the input is computed by the automatic differentiation algorithm $D^\mathtt{AD}$. We first show that given a floating-point function $φ$ (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network $f$ and $D^\mathtt{AD}(φ\circ f)$, respectively. We further extend this result: given $φ_1,\dots,φ_n$, $D^\mathtt{AD}(φ_i\circ f)$ can simultaneously represent arbitrary gradients while $f$ represents the target values, under mild conditions. Our results hold for practical activation functions, e.g., $\mathrm{ReLU}$, $\mathrm{ELU}$, $\mathrm{GeLU}$, $\mathrm{Swish}$, $\mathrm{Sigmoid}$, and $\mathrm{tanh}$.
翻译:理论研究证明,对紧致定义域上的任意可微函数,存在一个既能逼近函数值又能逼近梯度的神经网络。然而,该结果假设实数参数与精确内部运算,故无法应用于实际场景。相较而言,真实实现仅使用实数的有限子集及含舍入误差的机器运算。本文探究在浮点算术条件下,当输入梯度由自动微分算法$D^\mathtt{AD}$计算时,神经网络是否仍具有类似性质。我们首先证明:给定浮点函数$φ$(如损失函数),其任意函数值与梯度可分别由浮点网络$f$与$D^\mathtt{AD}(φ\circ f)$表示。进一步推广该结果:在温和条件下,对给定函数组$φ_1,\dots,φ_n$,当$f$表示目标值时,$D^\mathtt{AD}(φ_i\circ f)$可同步表示任意梯度。该结论适用于实际激活函数,例如$\mathrm{ReLU}$、$\mathrm{ELU}$、$\mathrm{GeLU}$、$\mathrm{Swish}$、$\mathrm{Sigmoid}$及$\mathrm{tanh}$。