Neural networks are powered by an implicit bias: a tendency of gradient descent to fit training data in a way that generalizes to unseen data. A recent class of neural network models gaining increasing popularity is structured state space models (SSMs), regarded as an efficient alternative to transformers. Prior work argued that the implicit bias of SSMs leads to generalization in a setting where data is generated by a low dimensional teacher. In this paper, we revisit the latter setting, and formally establish a phenomenon entirely undetected by prior work on the implicit bias of SSMs. Namely, we prove that while implicit bias leads to generalization under many choices of training data, there exist special examples whose inclusion in training completely distorts the implicit bias, to a point where generalization fails. This failure occurs despite the special training examples being labeled by the teacher, i.e. having clean labels! We empirically demonstrate the phenomenon, with SSMs trained independently and as part of non-linear neural networks. In the area of adversarial machine learning, disrupting generalization with cleanly labeled training examples is known as clean-label poisoning. Given the proliferation of SSMs, particularly in large language models, we believe significant efforts should be invested in further delineating their susceptibility to clean-label poisoning, and in developing methods for overcoming this susceptibility.
翻译:神经网络受一种隐式偏差驱动:梯度下降倾向于以能够泛化到未见数据的方式拟合训练数据。结构化状态空间模型(SSMs)作为Transformer的高效替代方案,正日益受到关注。先前研究认为,在数据由低维教师模型生成的情境中,SSMs的隐式偏差会导向泛化能力。本文重新审视该情境,并正式揭示先前关于SSMs隐式偏差的研究完全未察觉的现象:我们证明,虽然隐式偏差在多数训练数据选择下能实现泛化,但存在特殊样本——当其被纳入训练时,会彻底扭曲隐式偏差,导致泛化完全失效。这种失效现象的发生,恰恰是在这些特殊训练样本由教师模型标注(即具有干净标签)的情况下!我们通过独立训练的SSMs及作为非线性神经网络组成部分的SSMs,对该现象进行了实证验证。在对抗性机器学习领域,使用干净标签的训练样本来破坏泛化能力的行为被称为干净标签毒化攻击。鉴于SSMs(尤其在大型语言模型中)的广泛应用,我们认为应投入大量精力进一步厘清其对干净标签毒化的敏感性,并开发克服该脆弱性的有效方法。