Neural networks are trained primarily based on their inputs and outputs, without regard for their internal mechanisms. These neglected mechanisms determine properties that are critical for safety, like (i) transparency; (ii) the absence of sensitive information or harmful capabilities; and (iii) reliable generalization of goals beyond the training distribution. To address this shortcoming, we introduce gradient routing, a training method that isolates capabilities to specific subregions of a neural network. Gradient routing applies data-dependent, weighted masks to gradients during backpropagation. These masks are supplied by the user in order to configure which parameters are updated by which data points. We show that gradient routing can be used to (1) learn representations which are partitioned in an interpretable way; (2) enable robust unlearning via ablation of a pre-specified network subregion; and (3) achieve scalable oversight of a reinforcement learner by localizing modules responsible for different behaviors. Throughout, we find that gradient routing localizes capabilities even when applied to a limited, ad-hoc subset of the data. We conclude that the approach holds promise for challenging, real-world applications where quality data are scarce.
翻译:神经网络的训练主要基于其输入和输出,而忽略了其内部机制。这些被忽视的机制决定了对于安全性至关重要的特性,例如:(i)透明度;(ii)敏感信息或有害能力的缺失;以及(iii)目标在训练分布之外的可泛化性。为弥补这一不足,我们提出了梯度路由,这是一种将能力隔离到神经网络特定子区域的训练方法。梯度路由在反向传播过程中对梯度施加数据依赖的加权掩码。这些掩码由用户提供,用于配置哪些参数由哪些数据点更新。我们证明梯度路由可用于:(1)学习以可解释方式划分的表征;(2)通过消融预指定的网络子区域实现鲁棒的遗忘学习;(3)通过定位负责不同行为的模块,实现对强化学习器的可扩展监督。在整个研究中,我们发现即使仅应用于有限、临时的数据子集,梯度路由仍能有效实现能力局部化。我们得出结论,该方法在高质量数据稀缺的具有挑战性的现实应用中具有广阔前景。