User-driven privacy allows individuals to control whether and at what granularity their data is shared, leading to datasets that mix original, generalized, and missing values within the same records and attributes. While such representations are intuitive for privacy, they pose challenges for machine learning, which typically treats non-original values as new categories or as missing, thereby discarding generalization semantics. For learning from such tabular data, we propose novel data transformation strategies that account for heterogeneous anonymization and evaluate them alongside standard imputation and LLM-based approaches. We employ multiple datasets, privacy configurations, and deployment scenarios, demonstrating that our method reliably regains utility. Our results show that generalized values are preferable to pure suppression, that the best data preparation strategy depends on the scenario, and that consistent data representations are crucial for maintaining downstream utility. Overall, our findings highlight that effective learning is tied to the appropriate handling of anonymized values.
翻译:用户驱动的隐私保护允许个体控制其数据是否共享以及共享的粒度,导致数据集中同一记录和属性内混合了原始值、泛化值和缺失值。尽管此类表示方式在隐私保护方面具有直观性,但它们对机器学习提出了挑战,因为机器学习通常将非原始值视为新类别或缺失值,从而丢弃了泛化语义。为了从这类表格数据中学习,我们提出了新的数据转换策略,这些策略考虑了异构匿名化,并将其与标准插补方法和基于LLM的方法进行了比较评估。我们采用多个数据集、隐私配置和部署场景进行实验,证明我们的方法能够可靠地恢复数据效用。结果表明:泛化值优于纯抑制处理;最佳的数据准备策略取决于具体场景;一致的数据表示对于维持下游效用至关重要。总体而言,我们的研究强调,有效的学习与对匿名化值的恰当处理密切相关。