LLM deception is often evaluated through direct markers such as fabricated claims, explicit lies, or strategic concealment. However, many real-world misleading communications do not depend on false statements, rather, they arise from selective treatment of true material facts: omitting adverse evidence, softening unfavorable details, emphasizing favorable details, or replacing precise qualifications with vague language. Existing benchmarks largely miss this subtler and arguably more dangerous failure mode. We introduce JANUS, a benchmark for measuring goal-conditioned pragmatic distortion in fact-grounded LLM outputs. Each scenario in our benchmark provides a fixed pool of favorable and adverse facts and compares a neutral condition against a goal-directed condition, such as increasing adoption, enrollment, approval, or support, despite potential harm to directly affected individuals or groups. Because all outputs are constrained to use the same fact pool, JANUS isolates misleading net impressions from hallucination and fabrication. JANUS contains 160 scenarios across 8 domains, with each scenario paired with neutral and goal-conditioned prompts and annotated material facts. Extensive experiments across 12 LLMs reveal consistent goal-conditioned distortions, demonstrating that current models remain sensitive to incentive and framing objectives and lack robust safeguards against selectively misleading communication. We publicly release our corpus and code for future research.
翻译:大型语言模型的欺骗行为通常通过直接标志评估,例如虚构断言、明确谎言或策略性隐瞒。然而,现实世界中许多误导性沟通并非基于虚假陈述,而是源于对真实实质性事实的选择性处理:省略不利证据、淡化不利细节、强调有利细节,或用模糊措辞取代精确限定。现有基准测试大多忽视了这种更微妙且可能更危险的问题模式。我们提出JANUS,一个用于衡量基于事实的LLM输出中目标导向语用扭曲的基准测试。每个场景提供固定的有利和不利事实集,比较中性条件与目标导向条件(如提升采纳率、入学率、批准率或支持率),即使这可能对直接影响的个人或群体造成潜在伤害。由于所有输出均限于使用相同的事实池,JANUS将误导性整体印象与幻觉和捏造行为分离开来。JANUS涵盖8个领域的160个场景,每个场景配有中性及目标导向提示和已标注的实质性事实。在12个LLM上的广泛实验揭示了一致的目标导向扭曲,表明当前模型仍易受激励和框架目标影响,且缺乏针对选择性误导性沟通的稳健防护措施。我们公开发布语料库和代码以供未来研究。