The Privacy-Utility Tradeoff in Rank-Preserving Dataset Obfuscation

Dataset obfuscation refers to techniques in which random noise is added to the entries of a given dataset, prior to its public release, to protect against leakage of private information. In this work, dataset obfuscation under two objectives is considered: i) rank-preservation: to preserve the row ordering in the obfuscated dataset induced by a given rank function, and ii) anonymity: to protect user anonymity under fingerprinting attacks. The first objective, rank-preservation, is of interest in applications such as the design of search engines and recommendation systems, feature matching, and social network analysis. Fingerprinting attacks, considered in evaluating the anonymity objective, are privacy attacks where an attacker constructs a fingerprint of a victim based on its observed activities, such as online web activities, and compares this fingerprint with information extracted from a publicly released obfuscated dataset to identify the victim. By evaluating the performance limits of a class of obfuscation mechanisms over asymptotically large datasets, a fundamental trade-off is quantified between rank-preservation and user anonymity. Single-letter obfuscation mechanisms are considered, where each entry in the dataset is perturbed by independent noise, and their fundamental performance limits are characterized by leveraging large deviation techniques. The optimal obfuscating test-channel, optimizing the privacy-utility tradeoff, is characterized in the form of a convex optimization problem which can be solved efficiently. Numerical simulations of various scenarios are provided to verify the theoretical derivations.

翻译：数据集混淆是指在公开发布前向给定数据集的条目中添加随机噪声以防止隐私信息泄露的技术。本研究考虑在以下两个目标下的数据集混淆：i）秩保持：在混淆数据集中保留由给定秩函数诱导的行排序；ii）匿名性：在指纹攻击下保护用户匿名性。第一个目标——秩保持——在搜索引擎设计、推荐系统、特征匹配及社交网络分析等应用中具有重要价值。评估匿名性目标时所考虑的指纹攻击是一种隐私攻击方式：攻击者基于受害者观测到的活动（如在线网络行为）构建其指纹，并将该指纹与公开混淆数据集中提取的信息进行比对以识别受害者。通过评估一类混淆机制在渐近大数据集上的性能极限，本文量化了秩保持与用户匿名性之间的根本权衡。研究考虑单字母混淆机制（即数据集中每个条目被独立噪声扰动），并利用大偏差技术刻画其理论性能极限。优化隐私—效用权衡的最优混淆测试信道被表征为可高效求解的凸优化问题。最后通过多种场景的数值模拟验证理论推导的正确性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Meta最新WWW2022《联邦计算导论》教程，附77页ppt

专知会员服务

60+阅读 · 2022年5月5日

【干货书】隐私保留机器学习，Privacy-Preserving Machine Learning

专知会员服务

27+阅读 · 2022年4月6日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日