Large Language Models have received significant attention due to their abilities to solve a wide range of complex tasks. However these models memorize a significant proportion of their training data, posing a serious threat when disclosed at inference time. To mitigate this unintended memorization, it is crucial to understand what elements are memorized and why. Most existing works provide a posteriori explanations, which has a limited interest in practice. To address this gap, we propose a new approach based on sliced mutual information to detect memorized samples a priori, in a classification setting. It is efficient from the early stages of training, and is readily adaptable to practical scenarios. Our method is supported by new theoretical results that we demonstrate, and requires a low computational budget. We obtain strong empirical results, paving the way for systematic inspection and protection of these vulnerable samples before memorization happens.
翻译:大型语言模型因其解决广泛复杂任务的能力而受到广泛关注。然而,这些模型会记忆其训练数据中的相当大部分,在推理阶段泄露时构成严重威胁。为缓解这种非预期记忆,理解哪些内容被记忆及其原因至关重要。现有研究大多提供后验解释,其实际应用价值有限。为填补这一空白,我们提出一种基于切片互信息的新方法,在分类场景中实现先验检测记忆样本。该方法从训练早期阶段即具备高效性,并能轻松适配实际应用场景。我们的方法得到我们证明的新理论结果支持,且计算开销较低。我们获得了强有力的实证结果,为在记忆发生前系统化检查与保护这些脆弱样本开辟了新途径。