Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.

翻译：用于代码补全与生成的大型语言模型（LLM）在软件开发中日益普及，但其可能逐字复现训练示例且未标注作者归属，引发关于剽窃与许可合规性的法律与伦理问题。基于指纹识别的经典抄袭检测方法（如Winnowing）虽仍高效，但检测过程需将代码片段与整个训练集比对，其线性时间复杂度使其无法适用于训练现代代码LLM所需的十亿级语料库。为弥合这一鸿沟，我们提出SOURCETRACKER——一个专为代码检索定制的3亿参数编码器，并配套设计混合两阶段溯源管线HYBRIDSOURCETRACKER（HST）。HST先通过向量检索缩小候选片段集，再基于精确指纹利用Winnowing对候选结果重排序。我们在THESTACKV2数据集的1000万片段子集上训练并评估系统，其中包含逐字复制及模拟真实标识符重命名的适应性片段。在包含适应性查询的10万片段体外搜索空间中，我们的混合方法对30标记片段的平均倒数排名与Winnowing持平。当起始片段窗口≥60标记时，该方法持续提升性能高达5.4%，同时保持对数时间的查询复杂度。在基于LLM法官的补充评估中，我们发现许多未标记为真实值的检索片段仍与预期源高度相似（尤其在较长上下文窗口下），从而对最终用户具有实用价值。总体而言，我们的结果表明，将向量检索与指纹识别相结合，可为LLM生成的代码实现可扩展、高精度的溯源追踪。