It is crucial for modern on-device AI systems that rely on retrieval-augmented inference to release and share datastores without compromising individual privacy. This can be achieved using Differential Privacy (DP), which provides a formal guarantee that ensures individual contributions remain indistinguishable, even under adversarial analysis. In this paper, we introduce a hashing-based probability generation framework designed to enable the creation and release of differentially private datastores. Our approach employs locality-sensitive hashing (LSH) to efficiently partition high-dimensional data into buckets. We then add calibrated DP noise to the accumulated vote for each bucket, generating a probability distribution across classes. Our method is broadly applicable to any pipeline requiring secure key,value datastore creation and release. We conducted experiments on seven datasets with varying sample sizes and class counts, ranging from 2 to 14. At epsilon=5, our released DP datastore achieves strong privacy protection with only an average 2.6% drop in accuracy. Finally, we benchmark DP datastore resilience to membership inference attacks, reducing attack accuracy to 53.60%.
翻译:现代基于检索增强推理的端侧AI系统,在发布和共享数据存储时需确保不损害个人隐私。差分隐私(DP)通过形式化保证使个体贡献在对抗分析下仍保持不可区分性,可有效满足这一需求。本文提出一种基于哈希的概率生成框架,用于创建和发布差分隐私数据存储。该方法利用局部敏感哈希(LSH)将高维数据高效划分至桶中,再对每个桶的累计投票注入校准后的DP噪声,生成跨类别的概率分布。本方法可广泛适用于任何需要安全创建和发布键值对数据存储的流水线。我们在7个不同样本量(类别数2至14)的数据集上开展实验,在ε=5时,发布的DP数据存储实现了强隐私保护,仅平均降低2.6%的准确率。最后,我们评估了DP数据存储对成员推断攻击的鲁棒性,成功将攻击准确率降至53.60%。