If a language model cannot reliably disclose its AI identity in expert contexts, users cannot trust its competence boundaries. This study examines self-transparency in models assigned professional personas within high-stakes domains where false expertise risks user harm. Using a common-garden design, sixteen open-weight models (4B--671B parameters) were audited across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 30.8% disclosure initially, while a Neurosurgeon persona elicited only 3.5%. This creates preconditions for a "Reverse Gell-Mann Amnesia" effect, where transparency in some domains leads users to overgeneralize trust to contexts where disclosure fails. Disclosure ranged from 2.8% to 73.6%, with a 14B model reaching 61.4% while a 70B produced just 4.1%. Model identity predicted behavior better than parameter count ($ΔR_{adj}^{2} = 0.359$ vs 0.018). Reasoning optimization actively suppressed self-transparency in some models, with reasoning variants showing up to 48.4% lower disclosure than base counterparts. Bayesian validation with Rogan--Gladen correction confirmed robustness to measurement error ($κ= 0.908$). These findings demonstrate transparency reflects training factors rather than scale. Organizations cannot assume safety properties transfer to deployment contexts, requiring deliberate behavior design and empirical verification.
翻译:若语言模型在专业情境中无法可靠地披露其AI身份,用户便无法信任其能力边界。本研究考察了在高风险领域中被赋予专业角色(虚假专业知识可能对用户造成伤害)的模型自我透明度。采用同质环境设计,对16个开放权重模型(参数量4B至671B)进行了总计19,200次试验的审计。模型表现出显著的领域特异性不一致:金融顾问角色初始引发30.8%的披露率,而神经外科医生角色仅引发3.5%。这为“逆盖尔曼遗忘效应”创造了条件——即某些领域的透明度会导致用户将信任过度泛化至披露失败的场景。披露率区间为2.8%至73.6%,其中14B模型达到61.4%,而70B模型仅为4.1%。模型身份比参数量更能预测行为(ΔR_{adj}^{2} = 0.359对比0.018)。推理优化在某些模型中主动抑制了自我透明度,推理变体比基础版本披露率降低达48.4%。经Rogan–Gladen修正的贝叶斯验证证实了结果对测量误差的稳健性(κ= 0.908)。这些发现表明透明度反映的是训练因素而非规模效应。机构不能假定安全属性可迁移至部署环境,必须通过有意识的行为设计和实证验证来确保可靠性。