The advancement of large language models (LLMs) brings notable improvements across various applications, while simultaneously raising concerns about potential private data exposure. One notable capability of LLMs is their ability to form associations between different pieces of information, but this raises concerns when it comes to personally identifiable information (PII). This paper delves into the association capabilities of language models, aiming to uncover the factors that influence their proficiency in associating information. Our study reveals that as models scale up, their capacity to associate entities/information intensifies, particularly when target pairs demonstrate shorter co-occurrence distances or higher co-occurrence frequencies. However, there is a distinct performance gap when associating commonsense knowledge versus PII, with the latter showing lower accuracy. Despite the proportion of accurately predicted PII being relatively small, LLMs still demonstrate the capability to predict specific instances of email addresses and phone numbers when provided with appropriate prompts. These findings underscore the potential risk to PII confidentiality posed by the evolving capabilities of LLMs, especially as they continue to expand in scale and power.
翻译:大型语言模型(LLMs)的进步带来了各类应用的显著提升,同时也引发了对潜在私人数据泄露的担忧。LLMs的显著能力之一在于其能关联不同信息片段,但这在涉及个人身份信息(PII)时引发了问题。本文深入探究语言模型的关联能力,旨在揭示影响其信息关联熟练程度的因素。研究表明,随着模型规模扩大,其关联实体/信息的能力增强,尤其当目标对展现出更短的共现距离或更高的共现频率时。然而,在关联常识知识与PII时存在明显性能差距,后者准确率较低。尽管准确预测的PII比例相对较小,但LLMs在提供适当提示时仍展现出预测特定电子邮件地址和电话号码实例的能力。这些发现强调了LLMs不断演进的能力对PII保密性构成的潜在风险,尤其是在其规模和算力持续扩展的背景下。