As ML models become increasingly complex and integral to high-stakes domains such as finance and healthcare, they also become more susceptible to sophisticated adversarial attacks. We investigate the threat posed by undetectable backdoors in models developed by insidious external expert firms. When such backdoors exist, they allow the designer of the model to sell information to the users on how to carefully perturb the least significant bits of their input to change the classification outcome to a favorable one. We develop a general strategy to plant a backdoor to neural networks while ensuring that even if the model's weights and architecture are accessible, the existence of the backdoor is still undetectable. To achieve this, we utilize techniques from cryptography such as cryptographic signatures and indistinguishability obfuscation. We further introduce the notion of undetectable backdoors to language models and extend our neural network backdoor attacks to such models based on the existence of steganographic functions.
翻译:随着机器学习模型日益复杂并广泛应用于金融和医疗等高风险领域,它们也更容易受到复杂对抗性攻击的威胁。本研究探讨了由潜在外部专业公司开发的模型中不可检测后门所带来的风险。当此类后门存在时,模型设计者可以向用户出售如何微调输入数据最低有效位的方法,从而将分类结果导向有利方向。我们提出了一种在神经网络中植入后门的通用策略,该策略确保即使模型的权重和架构可被访问,后门的存在仍不可被检测。为实现这一目标,我们运用密码学技术,如密码学签名和不可区分混淆。我们进一步提出了语言模型中不可检测后门的概念,并基于隐写函数的存在性,将神经网络后门攻击扩展至此类模型。