Large language models (LLMs) are increasingly used to generate code at scale. Meanwhile, prior work has investigated whether training data may be recoverable from model outputs, by auditing the textual overlap between training examples and model generations. Code, however, can be functionally equivalent while textually dissimilar. In this work, we study functional memorization: extraction of functional logic beyond what verbatim metrics detect. We construct a counterfactual setup for Olmo-3-32B, comparing a midtrained model (exposed to target code) against a pretrained reference (not exposed). We prompt both models with Python function signatures and measure both textual and functional similarity (i.e., LLM-as-a-judge, execution-based). Our results show clear evidence of functional memorization, highlighting the need for auditing metrics that go beyond textual overlap.
翻译:大型语言模型(LLMs)正日益被用于大规模生成代码。与此同时,先前的研究已通过审计训练样本与模型生成结果之间的文本重叠,探究了训练数据是否可能从模型输出中恢复。然而,代码在功能上可能等价,但文本上却存在差异。在本文中,我们研究了功能记忆:即提取超出逐字指标检测范围的函数逻辑。我们为Olmo-3-32B构建了一个反事实实验设置,将中期训练模型(暴露于目标代码)与预训练参考模型(未暴露于目标代码)进行比较。我们用Python函数签名提示两个模型,并测量文本和功能相似性(即利用LLM作为评判机制、基于执行的方法)。我们的结果清晰展示了功能记忆的证据,凸显了需要超越文本重叠的审计指标。