Post-hoc unauthorized-training data detection for large language models (LLMs) typically assumes a query-with-originals regime: rights holders query a target LLM with raw proprietary data and assess whether the model assigns them stronger memorization-based detection signals, e.g., higher confidence or lower loss, than held-out non-training reference texts. We show that this regime becomes brittle under data laundering, where the target LLM is trained on semantics-preserving but stylistically or structurally transformed surrogates of proprietary data to obfuscate provenance. Since training-time exposure occurs in the laundered form, memorization signals may no longer appear on the originals, collapsing the candidate-reference signal separation that standard detectors rely on. We counter this threat by studying laundering-aware detection with raw proprietary data, a held-out reference corpus, and query access to the target LLM, while the laundering transformation is undisclosed. Since exact recovery of the laundered corpus is infeasible, we infer a detection-useful synthesis process via an auxiliary LLM that maps originals into training-like queries. To make this search tractable, we introduce Synthesis Data Reversion (SDR), which constrains the unbounded space of natural-language transformations through a goal-details abstraction: a high-level transformation goal, e.g., "lyrical rewriting", and fine-grained details, e.g., "with vivid imagery". SDR identifies the most likely goal and iteratively refines details so synthesized queries elicit stronger target-model detection signals. Evaluated on the MIMIR benchmark against diverse laundering practices and target LLM families (Pythia, Llama2, and Falcon), SDR consistently restores detection signals, offering a practical auditing layer against data laundering.
翻译:针对大语言模型的事后未经授权训练数据检测通常采用“查询-原始数据”范式:权益持有者向目标大语言模型提交原始专有数据进行查询,并评估模型是否为其分配比非训练参考文本更强的基于记忆的检测信号(例如更高置信度或更低损失值)。我们研究表明,这种范式在“数据洗白”场景下变得脆弱——目标大语言模型实际训练的是保留语义但改变风格或结构的专有数据变换版本,以此掩盖数据来源。由于训练阶段暴露的是洗白后的形式,记忆信号可能不再显现于原始数据上,导致标准检测器依赖的候选-参考信号差异失效。为应对这一威胁,我们研究了在原始专有数据、非训练参考语料库及对目标大语言模型的查询访问权限(变换方式未知)条件下的洗白感知型检测方法。由于精确恢复洗白语料库不可行,我们通过辅助大语言模型推断出对检测有用的合成过程,将原始数据映射为类似训练样本的查询语句。为简化搜索过程,我们提出**合成数据反演**方法,通过目标-细节抽象机制约束自然语言变换的无限空间:高层级变换目标(如“抒情式改写”)与细粒度细节(如“运用生动意象”)。该方法识别最可能的变换目标并迭代优化细节,使合成查询能激发更强的目标模型检测信号。在MIMIR基准测试中,针对多种洗白策略及目标大语言模型系列(Pythia、Llama2、Falcon)的评估表明,SDR能稳定恢复检测信号,为防范数据洗白提供了实用的审查层。