Detecting Semantic Backdoors in a Mystery Shopping Scenario

Detecting semantic backdoors in classification models--where some classes can be activated by certain natural, but out-of-distribution inputs--is an important problem that has received relatively little attention. Semantic backdoors are significantly harder to detect than backdoors that are based on trigger patterns due to the lack of such clearly identifiable patterns. We tackle this problem under the assumption that the clean training dataset and the training recipe of the model are both known. These assumptions are motivated by a consumer protection scenario, in which the responsible authority performs mystery shopping to test a machine learning service provider. In this scenario, the authority uses the provider's resources and tools to train a model on a given dataset and tests whether the provider included a backdoor. In our proposed approach, the authority creates a reference model pool by training a small number of clean and poisoned models using trusted infrastructure, and calibrates a model distance threshold to identify clean models. We propose and experimentally analyze a number of approaches to compute model distances and we also test a scenario where the provider performs an adaptive attack to avoid detection. The most reliable method is based on requesting adversarial training from the provider. The model distance is best measured using a set of input samples generated by inverting the models in such a way as to maximize the distance from clean samples. With these settings, our method can often completely separate clean and poisoned models, and it proves to be superior to state-of-the-art backdoor detectors as well.

翻译：检测分类模型中的语义后门——即某些类别可被特定自然但分布外的输入所激活——是一个重要但相对缺乏关注的问题。由于缺乏清晰可辨的触发模式，语义后门比基于触发模式的后门检测难度显著更高。我们在已知干净训练数据集和模型训练方案的前提下解决该问题。这一假设源于消费者保护场景：监管机构通过神秘购物方式测试机器学习服务提供商。在此场景中，监管机构利用提供商的资源与工具在给定数据集上训练模型，并检测提供商是否植入了后门。我们提出的方法中，监管机构通过可信基础设施训练少量干净与投毒模型构建参考模型池，并通过校准模型距离阈值来识别干净模型。我们提出并实验分析了多种计算模型距离的方法，同时测试了提供商为规避检测而实施自适应攻击的场景。最可靠的方法基于要求提供商进行对抗训练。模型距离的最佳测量方式是使用一组通过模型反演生成的输入样本，这些样本被构造为与干净样本的距离最大化。在此设置下，我们的方法常能完全区分干净模型与投毒模型，并证明其优于现有先进的后门检测方法。