This work introduces ClustEm4Ano, an anonymization pipeline that can be used for generalization and suppression-based anonymization of nominal textual tabular data. It automatically generates value generalization hierarchies (VGHs) that, in turn, can be used to generalize attributes in quasi-identifiers. The pipeline leverages embeddings to generate semantically close value generalizations through iterative clustering. We applied KMeans and Hierarchical Agglomerative Clustering on $13$ different predefined text embeddings (both open and closed-source (via APIs)). Our approach is experimentally tested on a well-known benchmark dataset for anonymization: The UCI Machine Learning Repository's Adult dataset. ClustEm4Ano supports anonymization procedures by offering more possibilities compared to using arbitrarily chosen VGHs. Experiments demonstrate that these VGHs can outperform manually constructed ones in terms of downstream efficacy (especially for small $k$-anonymity ($2 \leq k \leq 30$)) and therefore can foster the quality of anonymized datasets. Our implementation is made public.
翻译:本研究提出了ClustEm4Ano——一种可用于名义文本表格数据的泛化与抑制式匿名化处理流程。该流程能够自动生成值泛化层次结构,进而用于准标识符属性的泛化操作。通过迭代聚类技术,该流程利用嵌入向量生成语义相近的值泛化结果。我们在13种不同的预定义文本嵌入模型(包括开源模型与通过API调用的闭源模型)上分别应用了KMeans与层次凝聚聚类算法。本方法在匿名化领域的经典基准数据集——UCI机器学习资源库的Adult数据集上进行了实验验证。相较于任意选择的值泛化层次结构,ClustEm4Ano通过提供更多可能性来增强匿名化流程。实验表明,这些自动生成的层次结构在下游效能(特别是在较小k-匿名度(2 ≤ k ≤ 30)条件下)方面能够超越人工构建的层次结构,从而有效提升匿名化数据集的质量。本研究已公开相关实现代码。