Large-scale data collection, from national censuses to IoT-enabled smart homes, routinely gathers dozens of attributes per individual. These multi-attribute datasets are crucial for analytics but pose significant privacy risks. Local Differential Privacy (LDP) is a powerful tool for protecting user privacy by allowing users to locally perturb their records before releasing them to an untrusted data aggregator. However, existing LDP mechanisms either split the privacy budget across all attributes or treat each attribute independently, thereby ignoring natural inter-attribute correlations. This leads to excessive noise and, consequently, significant utility loss, particularly in high-dimensional datasets. We introduce a two-phase LDP framework that overcomes these limitations by privately learning and exploiting inter-attribute dependencies. In Phase~I, a small subset of users applies a standard per-attribute LDP mechanism, enabling the aggregator to derive dependency information from the privatized data. In Phase~II, each remaining user perturbs a single randomly chosen attribute with the full privacy budget, while the unreported attributes are reconstructed using Phase~I statistics, incurring no additional privacy cost. As a concrete instantiation, we develop Correlated Randomized Response (Corr-RR), which employs correlation-aware probabilistic mappings to substantially improve estimation accuracy. We prove that Corr-RR satisfies $ε$-LDP, and demonstrate through extensive experiments on synthetic and real-world datasets that it consistently outperforms state-of-the-art baselines, with the largest gains in high-dimensional and strongly correlated datasets.
翻译:从全国人口普查到物联网智能家居,大规模数据收集通常涉及每个个体的数十个属性。这些多属性数据集对分析至关重要,但也带来了显著的隐私风险。局部差分隐私是一种强大的隐私保护工具,允许用户在将记录发送至不可信的数据聚合器之前,在本地对数据进行扰动。然而,现有的LDP机制要么将隐私预算分摊至所有属性,要么独立处理每个属性,从而忽略了属性间固有的相关性。这导致噪声过度增加,进而造成显著的效用损失,尤其在高维数据集中更为明显。我们提出了一种两阶段LDP框架,通过隐私地学习并利用属性间依赖关系来克服这些限制。在第一阶段,一小部分用户应用标准的单属性LDP机制,使聚合器能够从隐私化数据中推导出依赖信息。在第二阶段,每位剩余用户使用全部隐私预算对随机选择的一个属性进行扰动,而未报告的属性则利用第一阶段的统计信息进行重建,且不产生额外的隐私成本。作为具体实现,我们开发了相关随机响应方法,该方法采用相关性感知的概率映射,显著提升了估计精度。我们证明了Corr-RR满足$ε$-LDP,并通过在合成和真实数据集上的大量实验表明,其性能始终优于现有先进基线方法,且在高维和强相关数据集上提升最为显著。