When language models are assigned professional personas, they face a conflict between maintaining the persona and disclosing their AI nature. How models resolve this conflict has practical consequences: a model that constructs detailed narratives of medical training and board certifications presents a surface of professional authority it does not possess. We systematically characterize this behavior using AI identity disclosure as a testbed: when probed about expertise origins, a model can either acknowledge its AI nature or maintain its assigned professional identity. Using a factorial design, sixteen open-weight models were audited across 19,200 trials. Under neutral conditions, models disclosed their AI nature in 99.8%-99.9% of interactions; assigning a professional persona reduced disclosure to 36.3% on average, though this suppression was highly context-dependent: the same models that maintained a neurosurgeon persona often disclosed under a financial advisor persona, a 9.7-fold difference. Counter to expectations that greater scale should support broader behavioral generalization, model size explained little of this variation, while model identity explained substantially more (Delta R_adj^2 = 0.375 vs. 0.012). We hypothesized that instruction-following dynamics contribute to these patterns and probed this directly: varying a single system prompt statement increased disclosure from 23.7% to 65.8%, while general honesty instructions produced negligible effects. Self-representational behavior does not generalize across professional contexts; instead, models exhibit sharp and sometimes unexpected differences under minor environmental changes, with training choices appearing to matter more than scale.
翻译:当语言模型被赋予专业角色时,它们面临着维持角色与披露AI本质之间的冲突。模型如何解决这一冲突具有实际影响:一个能详细编造医学培训和委员会认证经历叙述的模型,呈现出的专业权威表象远超其实际具备。我们以AI身份披露作为测试平台系统表征这一行为:当被追问专业知识来源时,模型要么承认其AI本质,要么维持被赋予的专业身份。通过因子设计,我们在19,200次试验中对16个开源权重模型进行了审计。在无偏条件下,模型在99.8%-99.9%的交互中披露了AI本质;赋予专业角色后,披露率平均降至36.3%,但这种抑制作用高度依赖情境:那些维持神经外科医生角色的模型,在金融顾问角色下往往选择披露,差异达9.7倍。与"更大规模应支持更广泛行为泛化"的预期相反,模型规模仅能解释极少量变异,而模型身份解释力显著更强(调整后R²差值=0.375 vs. 0.012)。我们推测指令遵循动力学贡献了这些模式并直接验证:改变单条系统提示语句使披露率从23.7%提升至65.8%,而通用诚实指令仅产生可忽略影响。自我表征行为并不跨专业情境泛化;相反,模型在微小环境变化下呈现显著且有时出乎意料的差异,训练选择似乎比规模更重要。