As large language models (LLMs) are increasingly trained on sensitive user data, understanding the fundamental cost of privacy in language learning becomes essential. We initiate the study of differentially private (DP) language identification and generation in the agnostic statistical setting, establishing algorithms and matching lower bounds that precisely quantify the cost of privacy. For both tasks, approximate $(\varepsilon, δ)$-DP with constant $\varepsilon > 0$ recovers the non-private error rates: $\exp(-r(n))$ for identification (for any $r(n) = o(n)$) and $\exp(-Ω(n))$ for generation. Under pure $\varepsilon$-DP, the exponents degrade by a multiplicative factor of $\min\{1, \varepsilon\}$, which we show is tight up to constants. Notably, for generation under pure DP with mild assumptions, the upper bound $\exp(-\min\{1,\varepsilon\} \cdot Ω(n))$ matches the lower bound up to some constants, establishing an optimal rate. Our results show that the cost of privacy in language learning is surprisingly mild: absent entirely under approximate DP, and exactly a $\min\{1,\varepsilon\}$ factor in the exponent under pure DP.
翻译:随着大型语言模型(LLMs)越来越多地基于敏感用户数据进行训练,理解语言学习中隐私的基本成本变得至关重要。我们首次在不可知统计设置下研究差分隐私(DP)的语言识别与生成问题,并提出了算法及其匹配的下界,精确量化了隐私的代价。对于这两项任务,近似 $(\varepsilon, δ)$-DP 在常数 $\varepsilon > 0$ 条件下恢复非私有错误率:识别任务为 $\exp(-r(n))$(其中 $r(n) = o(n)$),生成任务为 $\exp(-Ω(n))$。在纯 $\varepsilon$-DP 下,指数衰减速率乘以 $\min\{1, \varepsilon\}$ 因子,我们证明该结果在常数因子意义下是紧的。值得注意的是,在温和假设下的纯 DP 生成任务中,上界 $\exp(-\min\{1,\varepsilon\} \cdot Ω(n))$ 与下界在常数因子内匹配,确立了最优速率。我们的结果表明,语言学习中隐私的成本出奇地温和:在近似 DP 下完全消失,而在纯 DP 下仅导致指数项衰减 $\min\{1, \varepsilon\}$ 因子。