Language models are deployed in settings that require compartmentalization: system prompts should not be disclosed, chain-of-thought reasoning is hidden from users, and sensitive data passes through shared contexts. We test whether models can keep prompted information out of their writing. We give each model a secret word with instructions not to reveal it, then ask it to write a story. A second model tries to identify the secret from the story in a binary discrimination test. The secret word never appears literally in any output, but all five frontier models we test leak it thematically -- through topic choice, imagery, and setting--6hy-at rates significantly different from chance, up to 79\%. When told to actively hide the secret, models write \emph{away from} it, and this avoidance is itself detectable. The leakage is cross-model readable, scales sharply with model size within two model families, and disappears entirely for short-form writing like jokes. Giving the model a decoy concept to ``focus on instead'' partially redirects the leakage from the real secret to the decoy. Attending to a secret appears to open up an information channel that frontier LLMs cannot close, even when instructed to.
翻译:语言模型被部署在需要信息隔离的场景中:系统提示不应被泄露,思维链推理对用户隐藏,敏感数据通过共享上下文传输。我们测试模型能否在写作时不泄露提示中的信息。我们给每个模型一个保密词,要求其不得泄露,然后让其编写故事。第二个模型通过二元区分测试试图从故事中识别该秘密。秘密词从未在输出中直接出现,但我们测试的五个前沿模型均通过主题选择、意象和场景设置以显著偏离随机概率(最高达79%)的方式对其进行主题性泄露。当被要求主动隐藏秘密时,模型会刻意规避该词,而这种回避行为本身也可被检测。该泄露现象具有跨模型可读性,在两个模型族中随模型规模急剧增强,并在笑话等短文本形式中完全消失。让模型"专注于"一个替代概念可部分将泄露从真实秘密转移至替代概念。对秘密的关注似乎开辟了一个信息通道,而前沿LLMs即使被明确指示也无法将其关闭。