HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection

Data serves as the fundamental foundation for advancing deep learning, particularly tabular data presented in a structured format, which is highly conducive to modeling. However, even in the era of LLM, obtaining tabular data from sensitive domains remains a challenge due to privacy or copyright concerns. Hence, exploring how to effectively use models like LLMs to generate realistic and privacy-preserving synthetic tabular data is urgent. In this paper, we take a step forward to explore LLMs for tabular data synthesis and privacy protection, by introducing a new framework HARMONIC for tabular data generation and evaluation. In the tabular data generation of our framework, unlike previous small-scale LLM-based methods that rely on continued pre-training, we explore the larger-scale LLMs with fine-tuning to generate tabular data and enhance privacy. Based on idea of the k-nearest neighbors algorithm, an instruction fine-tuning dataset is constructed to inspire LLMs to discover inter-row relationships. Then, with fine-tuning, LLMs are trained to remember the format and connections of the data rather than the data itself, which reduces the risk of privacy leakage. In the evaluation part of our framework, we develop specific privacy risk metrics DLT for LLM synthetic data generation, as well as performance evaluation metrics LLE for downstream LLM tasks. Our experiments find that this tabular data generation framework achieves equivalent performance to existing methods with better privacy, which also demonstrates our evaluation framework for the effectiveness of synthetic data and privacy risks in LLM scenarios.

翻译：数据是推动深度学习发展的根本基础，尤其是以结构化形式呈现的表格数据，非常有利于建模。然而，即使在LLM时代，由于隐私或版权问题，从敏感领域获取表格数据仍然是一个挑战。因此，探索如何有效利用LLM等模型生成逼真且保护隐私的合成表格数据迫在眉睫。在本文中，我们向前迈进一步，通过引入一个新的用于表格数据生成与评估的框架HARMONIC，来探索利用LLM进行表格数据合成与隐私保护。在我们框架的表格数据生成部分，不同于以往依赖持续预训练的小规模LLM方法，我们探索通过微调更大规模的LLM来生成表格数据并增强隐私性。基于k近邻算法的思想，我们构建了一个指令微调数据集，以启发LLM发现行间关系。随后，通过微调，训练LLM记住数据的格式和关联，而非数据本身，从而降低了隐私泄露的风险。在我们框架的评估部分，我们为LLM合成数据生成开发了特定的隐私风险度量指标DLT，以及为下游LLM任务开发了性能评估指标LLE。我们的实验发现，该表格数据生成框架在实现与现有方法相当性能的同时，具有更好的隐私性，这也证明了我们评估框架在LLM场景下对合成数据有效性和隐私风险进行评估的有效性。