Adversarial attacks are usually expressed in terms of a gradient-based operation on the input data and model, this results in heavy computations every time an attack is generated. In this work, we solidify the idea of representing adversarial attacks as a trainable function, without further gradient computation. We first motivate that the theoretical best attacks, under proper conditions, can be represented as smooth piece-wise functions (piece-wise H\"older functions). Then we obtain an approximation result of such functions by a neural network. Subsequently, we emulate the ideal attack process by a neural network and reduce the adversarial training to a mathematical game between an attack network and a training model (a defense network). We also obtain convergence rates of adversarial loss in terms of the sample size $n$ for adversarial training in such a setting.
翻译:对抗攻击通常通过对输入数据和模型进行基于梯度的操作来实现,这导致每次生成攻击时都需要大量计算。本文巩固了将对抗攻击表示为可训练函数(无需额外梯度计算)的思想。首先,在适当条件下,理论上的最优攻击可以表示为光滑的分段函数(分段Hölder函数)。随后,我们证明了此类函数可通过神经网络进行逼近。接着,我们利用神经网络模拟理想攻击过程,将对抗训练简化为攻击网络与训练模型(防御网络)之间的数学博弈。此外,我们还在此框架下得到了对抗训练中对抗损失关于样本量$n$的收敛速率。