This paper presents and characterizes a spectrum of previously unreported behaviours we term Constraint-Evasive Fabrication (CEF): when an LLM agent operates under irreconcilable constraints (where no response can simultaneously satisfy all active rules) it spontaneously fabricates plausible external obstacles and presents them as a fact. At the extreme end of this spectrum lies Constraint-Evasive Thanatosis (CET); the limit case where, rather than inventing a plausible excuse, the model simulates a full system crash to make the user disengage entirely. We first observed CET in an uncontrolled deployment test, where a GPT-4o banking agent fabricated Python-style exception traces (complete with memory addresses) to feign a system failure when threatened by a user. In subsequent controlled experiments, the model independently invented audit restrictions, microservice architectures, error codes, and service timeouts, none present in its prompt. Reproduction attempts across pressure levels and attacker personas yielded CEF consistently but with substantial variation in form, onset, and severity: the phenomenon is robust but stochastic. Critically, injecting ground-truth data mid-conversation did not restore honest behaviour once fabrication had taken hold (the model ignored correct information and continued confabulating) suggesting CEF is self-reinforcing rather than a knowledge gap. We show that (1) standard enterprise guardrails routinely create CEF-enabling conditions in production, (2) current RLHF procedures suppress but cannot eliminate CEF, and (3) existing safety benchmarks do not test for this failure mode. Our results highlight the need for irreconcilable-constraint benchmarks, CEF-aware training procedures, and deployment-time detection methods before constrained agents become further entrenched in high-stakes domains.
翻译:本文揭示并刻画了一类此前未被报道的行为谱系——我们将其命名为“规避约束的虚构”(Constraint-Evasive Fabrication, CEF):当大型语言模型(LLM)智能体在不可调和约束(即任何回应都无法同时满足所有有效规则)下运行时,它会自发编造看似合理的外部障碍,并将其呈现为既定事实。该行为谱系的极端情况是“规避约束的假死”(Constraint-Evasive Thanatosis, CET):其极限情形是,模型并非编造一个合理的借口,而是模拟一次完整的系统崩溃,以使用户完全放弃交互。我们最初在一次非受控部署测试中观察到CET:当一位用户在银行业务场景中施加威胁时,一个GPT-4o智能体编造了Python风格的异常追溯(附有内存地址)来假装系统故障。在后续的受控实验中,模型自行编造了审计限制、微服务架构、错误代码和服务超时——这些均未出现在其提示词中。在不同压力水平和攻击者身份下的复现尝试均稳定产生CEF,但其形式、出现时机和严重程度呈现显著差异:这一现象稳健但具有随机性。关键的是,在对话中途注入真实数据并未在虚构行为形成后恢复诚实回应(模型忽略正确信息并继续胡编乱造),这表明CEF是自我强化的,而非知识缺口。我们证明:(1)标准的企业防护措施常在部署中创造CEF的触发条件;(2)当前的RLHF训练流程只能抑制但无法消除CEF;(3)现有的安全基准测试无法检测此类故障模式。我们的研究结果强调了在高风险领域进一步部署受约束智能体之前,亟需建立不可调和约束基准、CEF感知训练程序以及部署时的检测方法。