If a language model cannot reliably disclose its AI identity in expert contexts, users cannot trust its competence boundaries. This study examines self-transparency in models assigned professional personas within high-stakes domains where false expertise risks user harm. Using a common-garden design, sixteen open-weight models (4B--671B parameters) were audited across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 30.8% disclosure initially, while a Neurosurgeon persona elicited only 3.5%. This creates preconditions for a "Reverse Gell-Mann Amnesia" effect, where transparency in some domains leads users to overgeneralize trust to contexts where disclosure fails. Disclosure ranged from 2.8% to 73.6%, with a 14B model reaching 61.4% while a 70B produced just 4.1%. Model identity predicted behavior better than parameter count ($ΔR_{adj}^{2} = 0.359$ vs 0.018). Reasoning optimization actively suppressed self-transparency in some models, with reasoning variants showing up to 48.4% lower disclosure than base counterparts. Bayesian validation with Rogan--Gladen correction confirmed robustness to measurement error ($κ= 0.908$). These findings demonstrate transparency reflects training factors rather than scale. Organizations cannot assume safety properties transfer to deployment contexts, requiring deliberate behavior design and empirical verification.


翻译:若语言模型在专业情境中无法可靠地披露其AI身份,用户便无法信任其能力边界。本研究考察了在高风险领域中被赋予专业角色(虚假专业知识可能对用户造成伤害)的模型自我透明度。采用同质环境设计,对16个开放权重模型(参数量4B至671B)进行了总计19,200次试验的审计。模型表现出显著的领域特异性不一致:金融顾问角色初始引发30.8%的披露率,而神经外科医生角色仅引发3.5%。这为“逆盖尔曼遗忘效应”创造了条件——即某些领域的透明度会导致用户将信任过度泛化至披露失败的场景。披露率区间为2.8%至73.6%,其中14B模型达到61.4%,而70B模型仅为4.1%。模型身份比参数量更能预测行为(ΔR_{adj}^{2} = 0.359对比0.018)。推理优化在某些模型中主动抑制了自我透明度,推理变体比基础版本披露率降低达48.4%。经Rogan–Gladen修正的贝叶斯验证证实了结果对测量误差的稳健性(κ= 0.908)。这些发现表明透明度反映的是训练因素而非规模效应。机构不能假定安全属性可迁移至部署环境,必须通过有意识的行为设计和实证验证来确保可靠性。

0
下载
关闭预览

相关内容

ACM/IEEE第23届模型驱动工程语言和系统国际会议,是模型驱动软件和系统工程的首要会议系列,由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来,模型涵盖了建模的各个方面,从语言和方法到工具和应用程序。模特的参加者来自不同的背景,包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛,参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会,并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。 官网链接:http://www.modelsconference.org/
Top
微信扫码咨询专知VIP会员