Large language models (LLMs) produce systematically misleading outputs, from hallucinated citations to strategic deception of evaluators, yet these phenomena are studied by separate communities with incompatible terminology. We propose a unified taxonomy organized along three complementary dimensions: degree of goal-directedness (behavioral to strategic deception), object of deception, and mechanism (fabrication, omission, or pragmatic distortion). Applying this taxonomy to 50 existing benchmarks reveals that every benchmark tests fabrication while pragmatic distortion, attribution, and capability self-knowledge remain critically under-covered, and strategic deception benchmarks are nascent. We offer concrete recommendations for developers and regulators, including a minimal reporting template for positioning future work within our framework.
翻译:大语言模型会产生系统性的误导输出,从虚构引用到对评估者的策略性欺骗,然而这些现象被不同研究群体采用互不兼容的术语进行研究。我们提出了一种统一的分类体系,沿三个互补维度进行组织:目标导向程度(从行为性欺骗到策略性欺骗)、欺骗对象以及机制(虚构、省略或语用扭曲)。将该分类体系应用于50个现有基准测试后发现,每个基准测试都涉及虚构机制,而语用扭曲、归因以及能力自我认知仍严重缺乏覆盖,且策略性欺骗基准测试尚处于萌芽阶段。我们为开发者和监管者提供了具体建议,包括一个最小化报告模板,以便将未来研究工作纳入我们的框架体系。