When Reasoning Leaks Membership: Membership Inference Attack on Black-box Large Reasoning Models

Large Reasoning Models (LRMs) have rapidly gained prominence for their strong performance in solving complex tasks. Many modern black-box LRMs expose the intermediate reasoning traces through APIs to improve transparency (e.g., Gemini-2.5 and Claude-sonnet). Despite their benefits, we find that these traces can leak membership signals, creating a new privacy threat even without access to token logits used in prior attacks. In this work, we initiate the first systematic exploration of Membership Inference Attacks (MIAs) on black-box LRMs. Our preliminary analysis shows that LRMs produce confident, recall-like reasoning traces on familiar training member samples but more hesitant, inference-like reasoning traces on non-members. The representations of these traces are continuously distributed in the semantic latent space, spanning from familiar to unfamiliar samples. Building on this observation, we propose BlackSpectrum, the first membership inference attack framework targeting the black-box LRMs. The key idea is to construct a recall-inference axis in the semantic latent space, based on representations derived from the exposed traces. By locating where a query sample falls along this axis, the attacker can obtain a membership score and predict how likely it is to be a member of the training data. Additionally, to address the limitations of outdated datasets unsuited to modern LRMs, we provide two new datasets to support future research, arXivReasoning and BookReasoning. Empirically, exposing reasoning traces significantly increases the vulnerability of LRMs to membership inference attacks, leading to large gains in attack performance. Our findings highlight the need for LRM companies to balance transparency in intermediate reasoning traces with privacy preservation.

翻译：大型推理模型（LRMs）因其在解决复杂任务方面的强大性能而迅速获得关注。许多现代黑盒LRM通过API公开中间推理轨迹以提高透明度（例如Gemini-2.5和Claude-sonnet）。尽管这些轨迹具有优势，但我们发现它们可能泄露成员身份信号，即使在没有先前攻击所使用的词元对数概率的情况下，也会造成新的隐私威胁。在本研究中，我们首次系统性地探索针对黑盒LRMs的成员推理攻击（MIAs）。初步分析表明，LRMs在面对熟悉的训练成员样本时会产生自信的、类似回忆的推理轨迹，而在面对非成员样本时则会产生更犹豫的、类似推断的推理轨迹。这些轨迹的表征在语义潜在空间中呈连续分布，涵盖从熟悉到不熟悉的样本。基于这一观察，我们提出了BlackSpectrum——首个针对黑盒LRMs的成员推理攻击框架。其核心思想是在语义潜在空间中构建一个回忆-推断轴，该轴基于从公开轨迹中提取的表征。通过定位查询样本在该轴上的位置，攻击者可以获得成员身份评分，并预测该样本属于训练数据的可能性。此外，为解决现有数据集不适用于现代LRMs的局限性，我们提供了两个新数据集arXivReasoning和BookReasoning以支持未来研究。实验表明，公开推理轨迹会显著增加LRMs遭受成员推理攻击的脆弱性，导致攻击性能大幅提升。我们的研究结果强调了LRM公司需要在中间推理轨迹的透明度与隐私保护之间取得平衡。