Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.

翻译：用于代码补全与生成的大型语言模型（LLM）在软件开发中的应用日益广泛，但它们可能逐字复现训练样例且未标注作者归属，由此引发关于抄袭和许可合规性的法律与伦理问题。基于指纹特征的传统抄袭检测算法（如Winnowing）仍保持高效性，但检测过程需要将代码片段与整个训练集比对，其线性时间搜索复杂度使得这类方法无法适用于现代代码LLM训练所用的十亿级语料库。为解决这一挑战，我们提出SOURCETRACKER——一个专为代码检索定制的3亿参数编码器，并构建了混合两阶段溯源追踪流水线HYBRIDSOURCETRACKER（HST）。HST首先通过向量搜索缩小候选片段集，随后基于精确指纹特征采用Winnowing算法对候选结果进行重排序。我们在THESTACKV2数据集的1000万片段子集上训练与评估系统，该数据集包含模拟真实标识符重命名的逐字复现片段与改编片段。在包含10万片段搜索空间及改编查询的离体验证中，我们的混合方法对30词元片段实现了与Winnowing相当的平均倒数排名。当窗口长度增至60词元以上时，该方法持续取得最高达5.4%的性能提升，同时保持对数级查询复杂度。在基于LLM的补充评测中，我们发现许多未被标注为真实来源的检索片段仍与预期源高度相似（尤其在更长上下文窗口下），因而对终端用户具有实际价值。整体而言，我们的研究证明，将向量搜索与指纹特征相结合，能够为LLM生成的代码实现可扩展的高精度溯源追踪。