In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks by conditioning on demonstrations of question-answer pairs and it has been shown to have comparable performance to costly model retraining and fine-tuning. Recently, ICL has been extended to allow tabular data to be used as demonstration examples by serializing individual records into natural language formats. However, it has been shown that LLMs can leak information contained in prompts, and since tabular data often contain sensitive information, understanding how to protect the underlying tabular data used in ICL is a critical area of research. This work serves as an initial investigation into how to use differential privacy (DP) -- the long-established gold standard for data privacy and anonymization -- to protect tabular data used in ICL. Specifically, we investigate the application of DP mechanisms for private tabular ICL via data privatization prior to serialization and prompting. We formulate two private ICL frameworks with provable privacy guarantees in both the local (LDP-TabICL) and global (GDP-TabICL) DP scenarios via injecting noise into individual records or group statistics, respectively. We evaluate our DP-based frameworks on eight real-world tabular datasets and across multiple ICL and DP settings. Our evaluations show that DP-based ICL can protect the privacy of the underlying tabular data while achieving comparable performance to non-LLM baselines, especially under high privacy regimes.
翻译:上下文学习(ICL)使大型语言模型(LLM)能够通过在问题-答案对的示例上进行条件化来适应新任务,并且已被证明具有与昂贵的模型重训练和微调相当的性能。最近,ICL已被扩展至允许将表格数据用作演示示例,方法是将单个记录序列化为自然语言格式。然而,研究表明LLM可能泄露提示中包含的信息,而由于表格数据通常包含敏感信息,理解如何保护ICL中使用的底层表格数据是一个关键的研究领域。本文作为一项初步研究,探讨如何使用差分隐私(DP)——这一长期确立的数据隐私和匿名化的黄金标准——来保护ICL中使用的表格数据。具体而言,我们研究了在序列化和提示之前通过数据私有化来应用DP机制以实现私有表格ICL的方法。我们通过向单个记录或群体统计信息中注入噪声,分别提出了在本地(LDP-TabICL)和全局(GDP-TabICL)DP场景下具有可证明隐私保证的两种私有ICL框架。我们在八个真实世界的表格数据集上以及多种ICL和DP设置下评估了基于DP的框架。评估结果表明,基于DP的ICL能够保护底层表格数据的隐私,同时实现与非LLM基线相当的性能,尤其是在高隐私保护机制下。