The advancement of large language models (LLMs) brings notable improvements across various applications, while simultaneously raising concerns about potential private data exposure. One notable capability of LLMs is their ability to form associations between different pieces of information, but this raises concerns when it comes to personally identifiable information (PII). This paper delves into the association capabilities of language models, aiming to uncover the factors that influence their proficiency in associating information. Our study reveals that as models scale up, their capacity to associate entities/information intensifies, particularly when target pairs demonstrate shorter co-occurrence distances or higher co-occurrence frequencies. However, there is a distinct performance gap when associating commonsense knowledge versus PII, with the latter showing lower accuracy. Despite the proportion of accurately predicted PII being relatively small, LLMs still demonstrate the capability to predict specific instances of email addresses and phone numbers when provided with appropriate prompts. These findings underscore the potential risk to PII confidentiality posed by the evolving capabilities of LLMs, especially as they continue to expand in scale and power.
翻译:大型语言模型(LLM)的进步在各类应用中带来了显著改进,同时也引发了关于潜在私人数据暴露的担忧。LLM的一项显著能力是它们能够建立不同信息片段之间的关联,但当涉及个人身份信息(PII)时,这一能力便引发了关注。本文深入探讨了语言模型的关联能力,旨在揭示影响其信息关联效率的因素。我们的研究表明,随着模型规模的扩大,它们关联实体/信息的能力增强,特别是当目标配对表现出更短的共现距离或更高的共现频率时。然而,在关联常识知识与PII之间存在明显的性能差距,后者显示出较低的准确率。尽管准确预测的PII比例相对较小,但在提供适当提示的情况下,LLM仍展现出预测特定电子邮件地址和电话号码实例的能力。这些发现强调了LLM不断发展的能力对PII保密性构成的潜在风险,尤其是当它们持续扩展规模和能力时。