Modern machine learning models are increasingly deployed behind APIs. This renders standard weight-privatization methods (e.g. DP-SGD) unnecessarily noisy at the cost of utility. While model weights may vary significantly across training datasets, model responses to specific inputs are much lower dimensional and more stable. This motivates enforcing privacy guarantees directly on model outputs. We approach this under PAC privacy, which provides instance-based privacy guarantees for arbitrary black-box functions by controlling mutual information (MI). Importantly, PAC privacy explicitly rewards output stability with reduced noise levels. However, a central challenge remains: response privacy requires composing a large number of adaptively chosen, potentially adversarial queries issued by untrusted users, where existing composition results on PAC privacy are inadequate. We introduce a new algorithm that achieves adversarial composition via adaptive noise calibration and prove that mutual information guarantees accumulate linearly under adaptive and adversarial querying. Experiments across tabular, vision, and NLP tasks show that our method achieves high utility at extremely small per-query privacy budgets. On CIFAR-10, we achieve 87.79% accuracy with a per-step MI budget of $2^{-32}$. This enables serving one million queries while provably bounding membership inference attack (MIA) success rates to 51.08% -- the same guarantee of $(0.04, 10^{-5})$-DP. Furthermore, we show that private responses can be used to label public data to distill a publishable privacy-preserving model; using an ImageNet subset as a public dataset, our model distilled from 210,000 responses achieves 91.86% accuracy on CIFAR-10 with MIA success upper-bounded by 50.49%, which is comparable to $(0.02,10^{-5})$-DP.
翻译:现代机器学习模型越来越多地通过API部署。这使得标准的权重私有化方法(如DP-SGD)在效用成本上引入了不必要的噪声。虽然模型权重可能因训练数据集的不同而产生显著变化,但模型对特定输入的响应维度更低且更为稳定。这促使我们直接在模型输出上实施隐私保证。我们基于PAC隐私框架来解决这一问题,该框架通过控制互信息(MI)为任意黑盒函数提供基于实例的隐私保证。重要的是,PAC隐私明确奖励输出稳定性,从而降低噪声水平。然而,一个核心挑战依然存在:响应隐私需要组合大量由不可信用户发出的、自适应选择且可能具有对抗性的查询,而现有的PAC隐私组合结果对此并不充分。我们提出了一种新算法,通过自适应噪声校准实现对抗性组合,并证明了在自适应和对抗性查询下,互信息保证呈线性累积。在表格数据、视觉和自然语言处理任务上的实验表明,我们的方法在极小的单次查询隐私预算下实现了高效用。在CIFAR-10数据集上,我们在单步MI预算为$2^{-32}$的条件下达到了87.79%的准确率。这使得在服务一百万次查询的同时,可证明地将成员推理攻击(MIA)成功率上限定为51.08%——这与$(0.04, 10^{-5})$-差分隐私具有相同的保证强度。此外,我们证明了私有响应可用于标注公共数据,从而蒸馏出一个可发布的隐私保护模型;使用ImageNet的一个子集作为公共数据集,我们从210,000个响应中蒸馏出的模型在CIFAR-10上达到了91.86%的准确率,且MIA成功率上限定为50.49%,其隐私保护强度与$(0.02,10^{-5})$-差分隐私相当。