SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models

Large language models (LLMs) are increasingly deployed in multi-agent settings where communication must balance informativeness and secrecy. In such settings, an agent may need to signal information to collaborators while preventing an adversary from inferring sensitive details. However, existing LLM benchmarks primarily evaluate capabilities such as reasoning, factual knowledge, or instruction following, and do not directly measure strategic communication under asymmetric information. We introduce SNEAK (Secret-aware Natural language Evaluation for Adversarial Knowledge), a benchmark for evaluating selective information sharing in language models. In SNEAK, a model is given a semantic category, a candidate set of words, and a secret word, and must generate a message that indicates knowledge of the secret without revealing it too clearly. We evaluate generated messages using two simulated agents with different information states: an ally, who knows the secret and must identify the intended message, and a chameleon, who does not know the secret and attempts to infer it from the message. This yields two complementary metrics: utility, measuring how well the message communicates to collaborators, and leakage, measuring how much information it reveals to an adversary. Using this framework, we analyze the trade-off between informativeness and secrecy in modern language models and show that strategic communication under asymmetric information remains a challenging capability for current systems. Notably, human participants outperform all evaluated models by a large margin, achieving up to four times higher scores.

翻译：大型语言模型（LLM）日益被部署在多智能体环境中，其中通信需兼顾信息性与保密性。在此类场景中，智能体需向合作者传递信号，同时防止对手推断敏感细节。然而，现有LLM基准主要评估推理、事实知识或指令遵循等能力，并未直接衡量非对称信息下的战略通信。我们提出SNEAK（面向对抗性知识的秘密感知自然语言评估），一个评估语言模型中选择性信息共享的基准。在SNEAK中，模型获得一个语义类别、一个候选词集和一个秘密词，须生成一条消息，既表明知晓秘密，又不至于过于清晰地暴露它。我们使用两个信息状态不同的模拟智能体评估生成消息：盟友（知晓秘密并识别目标消息）和变色龙（不知晓秘密并试图从消息中推断）。这产生两个互补指标：效用（衡量消息与合作者的通信效果）和泄漏（衡量消息向对手泄露的信息量）。利用该框架，我们分析了现代语言模型中信息性与保密性之间的权衡，并表明非对称信息下的战略通信对当前系统仍是一项具有挑战性的能力。值得注意的是，人类参与者以大幅优势超越所有评估模型，得分高达其四倍。