We propose a system for marking sensitive or copyrighted texts to detect their use in fine-tuning large language models under black-box access with statistical guarantees. Our method builds digital ``marks'' using invisible Unicode characters organized into (``cue'', ``reply'') pairs. During an audit, prompts containing only ``cue'' fragments are issued to trigger regurgitation of the corresponding ``reply'', indicating document usage. To control false positives, we compare against held-out counterfactual marks and apply a ranking test, yielding a verifiable bound on the false positive rate. The approach is minimally invasive, scalable across many sources, robust to standard processing pipelines, and achieves high detection power even when marked data is a small fraction of the fine-tuning corpus.
翻译:我们提出一种系统,用于标记敏感或受版权保护的文本,以在具有统计保证的黑盒访问条件下检测其在微调大型语言模型中的使用。我们的方法利用不可见Unicode字符构建数字“标记”,这些字符被组织成(“提示”,“回应”)对。在审计过程中,仅包含“提示”片段的提示词被输入模型,以触发模型输出相应的“回应”,从而表明文档被使用。为控制误报率,我们与预留的反事实标记进行对比,并应用排序检验,从而得出可验证的误报率边界。该方法侵入性极低,可跨多源扩展,对标准处理流程具有鲁棒性,即使在标记数据仅占微调语料库极小比例时仍能实现高检测效力。