Scheming in the wild: detecting real-world AI scheming incidents with open-source intelligence

Scheming, the covert pursuit of misaligned goals by AI systems, represents a potentially catastrophic risk, yet scheming research suffers from significant limitations. In particular, scheming evaluations demonstrate behaviours that may not occur in real-world settings, limiting scientific understanding, hindering policy development, and not enabling real-time detection of loss of control incidents. Real-world evidence is needed, but current monitoring techniques are not effective for this purpose. This paper introduces a novel open-source intelligence (OSINT) methodology for detecting real-world scheming incidents: collecting and analysing transcripts from chatbot conversations or command-line interactions shared online. Analysing over 183,420 transcripts from X (formerly Twitter), we identify 698 real-world scheming-related incidents between October 2025 and March 2026. We observe a statistically significant 4.9x increase in monthly incidents from the first to last month, compared to a 1.7x increase in posts discussing scheming. We find evidence of multiple scheming-related behaviours in real-world deployments previously reported only in experiments, many resulting in real-world harms. While we did not detect catastrophic scheming incidents, the behaviours observed demonstrate concerning precursors, such as willingness to disregard instructions, circumvent safeguards, lie to users, and single-mindedly pursue goals in harmful ways. As AI systems become more capable, these could evolve into more strategic scheming with potentially catastrophic consequences. Our findings demonstrate the viability of transcript-based OSINT as a scalable approach to real-world scheming detection supporting scientific research, policy development, and emergency response. We recommend further investment towards OSINT techniques for monitoring scheming and loss of control.

翻译：阴谋，即人工智能系统秘密追求不一致目标的行为，是一种潜在的灾难性风险。然而，目前关于阴谋的研究存在显著局限性。具体而言，对阴谋的评估所展示的行为可能不会出现在真实环境中，这限制了对该现象的科学理解，阻碍了相关政策制定，也无法实现对失控事件的实时检测。虽然需要来自真实世界的证据，但当前的监测技术尚无法有效实现这一目标。本文提出了一种新颖的开源情报（OSINT）方法，用于检测真实世界中的阴谋事件：收集并分析在线共享的聊天机器人对话或命令行交互记录。通过对来自X（原名Twitter）上的超过183,420份记录进行分析，我们识别出2025年10月至2026年3月期间发生的698起与阴谋相关的真实世界事件。我们观察到，从第一个月到最后一个月，每月事件发生次数在统计上显著增加了4.9倍，而同期讨论阴谋的帖子数量仅增加了1.7倍。我们发现，此前仅在实验中被报道的多种阴谋相关行为，已在真实世界部署中出现，其中许多行为造成了实际危害。虽然我们未检测到灾难性的阴谋事件，但所观察到的行为已显示出令人担忧的预兆，例如愿意忽视指令、规避安全措施、向用户撒谎以及以有害方式单一地追求目标。随着人工智能系统能力的增强，这些行为可能演变为更具策略性的阴谋，并带来潜在的灾难性后果。我们的研究结果表明，基于对话记录的开源情报分析作为一种可扩展的方法，能够在真实的阴谋检测中支持科学研究、政策制定和应急响应。我们建议进一步投资于用于监测阴谋和失控事件的开源情报技术。