Differentially Private Language Generation and Identification in the Limit

We initiate the study of language generation in the limit, a model recently introduced by Kleinberg and Mullainathan [KM24], under the constraint of differential privacy. We consider the continual release model, where a generator must eventually output a stream of valid strings while protecting the privacy of the entire input sequence. Our first main result is that for countable collections of languages, privacy comes at no qualitative cost: we provide an $\varepsilon$-differentially-private algorithm that generates in the limit from any countable collection. This stands in contrast to many learning settings where privacy renders learnability impossible. However, privacy does impose a quantitative cost: there are finite collections of size $k$ for which uniform private generation requires $Ω(k/\varepsilon)$ samples, whereas just one sample suffices non-privately. We then turn to the harder problem of language identification in the limit. Here, we show that privacy creates fundamental barriers. We prove that no $\varepsilon$-DP algorithm can identify a collection containing two languages with an infinite intersection and a finite set difference, a condition far stronger than the classical non-private characterization of identification. Next, we turn to the stochastic setting where the sample strings are sampled i.i.d. from a distribution (instead of being generated by an adversary). Here, we show that private identification is possible if and only if the collection is identifiable in the adversarial model. Together, our results establish new dimensions along which generation and identification differ and, for identification, a separation between adversarial and stochastic settings induced by privacy constraints.

翻译：我们研究了在差分隐私约束下的极限语言生成问题，该模型由Kleinberg和Mullainathan [KM24]近期提出。在持续发布模型（continual release model）中，语言生成器必须最终输出一段有效字符串序列，同时保障整个输入序列的隐私性。我们的首个主要结论是：对于可数语言族而言，隐私保护不产生本质性代价——我们构造了能从任意可数语言族中实现极限生成的ε-差分隐私算法。这与许多学习场景中隐私导致不可学习性的情况形成鲜明对比。然而隐私确实带来了量化代价：存在规模为k的有限语言族，其统一隐私生成需要Ω(k/ε)个样本，而非隐私场景下仅需一个样本。随后我们转向更具挑战性的极限语言识别问题，证明隐私会引发根本性障碍。研究表明，若语言族包含两个具有无限交集和有限集差的语言，则任何ε-DP算法都无法实现识别——这一条件远比经典非隐私识别特征描述更为严格。最后我们考虑随机场景：当样本字符串从分布中独立同分布采样（而非由对抗方生成）时，隐私识别可行的充要条件是该语言族在对抗模型下可识别。综合上述结果，我们揭示了生成与识别在隐私约束下产生的新维度差异，并证明对于识别问题，隐私约束会导致对抗场景与随机场景之间出现分离现象。