In model extraction attacks, the goal is to reveal the parameters of a black-box machine learning model by querying the model for a selected set of data points. Due to an increasing demand for explanations, this may involve counterfactual queries besides the typically considered factual queries. In this work, we consider linear models and three types of queries: factual, counterfactual, and robust counterfactual. First, for an arbitrary set of queries, we derive novel mathematical formulations for the classification regions for which the decision of the unknown model is known, without recovering any of the model parameters. Second, we derive bounds on the number of queries needed to extract the model's parameters for (robust) counterfactual queries under arbitrary norm-based distances. We show that the full model can be recovered using just a single counterfactual query when differentiable distance measures are employed. In contrast, when using polyhedral distances for instance, the number of required queries grows linearly with the dimension of the data space. For robust counterfactuals, the latter number of queries doubles. Consequently, the applied distance function and robustness of counterfactuals have a significant impact on the model's security.
翻译:在模型提取攻击中,目标是通过为选定数据点查询模型来揭示黑盒机器学习模型的参数。由于对解释性需求的日益增长,除了通常考虑的事实查询外,这可能还涉及反事实查询。在本工作中,我们考虑线性模型及三种查询类型:事实查询、反事实查询和鲁棒反事实查询。首先,对于任意查询集,我们推导了分类区域的新数学公式,这些区域中未知模型的决策是已知的,而无需恢复任何模型参数。其次,我们推导了在任意基于范数的距离度量下,提取模型参数所需(鲁棒)反事实查询数量的界限。我们证明,当采用可微距离度量时,仅需单个反事实查询即可完整恢复模型。相反,当使用多面体距离时,所需查询数量随数据空间维度线性增长。对于鲁棒反事实,后者的查询数量会翻倍。因此,所采用的距离函数和反事实的鲁棒性对模型安全性具有显著影响。