We argue that many Anthropomorphic Misalignment Research (AMR) studies need stronger evidence to ensure that they can provide a robust foundation for critical safety decisions, such as model deployment and regulation. By evaluating failure modes across different misalignment concepts, such as deception, emergent misalignment, and sycophancy, we show how conceptual ambiguity, non-robust datasets, experimental design, and insufficient causal interventions can lead to overinterpretation of model behaviors. This position paper aims to offer guidance on evidentiary considerations that can help improve methodological rigor in AMR. To achieve this, we provide a clear call to action through a proposed framework of evidence levels and a diagnostic checklist. These shared standards will enable more productive scientific discourse and ensure that claims about AI risks rest on solid empirical foundations.
翻译:我们认为,许多拟人化失调研究(AMR)需要更强证据,以确保其能够为关键的安全决策(如模型部署与监管)提供坚实基础。通过评估不同失调概念(如欺骗、突发性失调及奉承行为)的失败模式,我们展示了概念模糊性、非鲁棒性数据集、实验设计不足以及因果干预不充分如何导致对模型行为的过度解读。本文旨在提供关于证据考量的指导,以帮助提升AMR的方法论严谨性。为此,我们通过提出证据层级框架和诊断检查清单,明确呼吁采取行动。这些共享标准将推动更富有成效的科学讨论,并确保关于AI风险的论断建立在坚实的实证基础之上。