Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection

LLMs are increasingly embedded in everyday decision-making, yet their outputs can encode subtle, unintended behaviours that shape user beliefs and actions. We refer to these covert, goal-directed behaviours as hidden intentions, which may arise from training and optimisation artefacts, or be deliberately induced by an adversarial developer, yet remain difficult to detect in practice. We introduce a taxonomy of ten categories of hidden intentions, grounded in social science research and organised by intent, mechanism, context, and impact, shifting attention from surface-level behaviours to design-level strategies of influence. We show how hidden intentions can be easily induced in controlled models, providing both testbeds for evaluation and demonstrations of potential misuse. We systematically assess detection methods, including reasoning and non-reasoning LLM judges, and find that detection collapses in realistic open-world settings, particularly under low-prevalence conditions, where false positives overwhelm precision and false negatives conceal true risks. Stress tests on precision-prevalence and precision-FNR trade-offs reveal why auditing fails without vanishingly small false positive rates or strong priors on manipulation types. Finally, a qualitative case study shows that all ten categories manifest in deployed, state-of-the-art LLMs, emphasising the urgent need for robust frameworks. Our work provides the first systematic analysis of detectability failures of hidden intentions in LLMs under open-world settings, offering a foundation for understanding, inducing, and stress-testing such behaviours, and establishing a flexible taxonomy for anticipating evolving threats and informing governance.

翻译：大语言模型正日益融入日常决策过程，但其输出可能编码着微妙、非预期的行为模式，从而影响用户的信念与行动。我们将这些隐蔽的、目标导向的行为称为隐藏意图，它们可能源于训练与优化过程中的伪影，也可能由对抗性开发者刻意诱导产生，但在实践中仍难以被检测。基于社会科学研究，我们提出了一个包含十类隐藏意图的分类体系，按意图、机制、上下文和影响进行组织，将关注点从表层行为转向设计层面的影响策略。我们展示了如何在受控模型中轻易诱导隐藏意图，既为评估提供了测试基准，也演示了潜在的滥用场景。我们系统评估了包括推理型与非推理型大语言模型评判器在内的检测方法，发现在现实开放世界场景中——尤其是在低发生率条件下——检测机制会失效，此时误报会淹没检测精度，而漏报则掩盖了真实风险。针对精度-发生率与精度-漏报率之间权衡关系的压力测试揭示了为何审计在无法达到极低误报率或缺乏对操纵类型的强先验知识时会失败。最后，一项定性案例研究表明，所有十类隐藏意图在已部署的先进大语言模型中均有显现，这凸显了建立鲁棒框架的迫切性。本研究首次系统分析了开放世界场景下大语言模型中隐藏意图的可检测性失效问题，为理解、诱导和压力测试此类行为提供了基础，并建立了一个灵活的分类体系以预判不断演变的威胁，为治理决策提供参考。