AI safety is an increasingly urgent concern as the capabilities and adoption of AI systems grow. Existing evolutionary models of AI governance have primarily examined incentives for safe development and effective regulation, typically representing users' trust as a one-shot adoption choice rather than as a dynamic, evolving process shaped by repeated interactions. We instead model trust as reduced monitoring in a repeated, asymmetric interaction between users and AI developers, where checking AI behaviour is costly. Using evolutionary game theory, we study how user trust strategies and developer choices between safe (compliant) and unsafe (non-compliant) AI co-evolve under different levels of monitoring cost and institutional regimes. We complement the infinite-population replicator analysis with stochastic finite-population dynamics and reinforcement learning (Q-learning) simulations. Across these approaches, we find three robust long-run regimes: no adoption with unsafe development, unsafe but widely adopted systems, and safe systems that are widely adopted. Only the last is desirable, and it arises when penalties for unsafe behaviour exceed the extra cost of safety and users can still afford to monitor at least occasionally. Our results formally support governance proposals that emphasise transparency, low-cost monitoring, and meaningful sanctions, and they show that neither regulation alone nor blind user trust is sufficient to prevent evolutionary drift towards unsafe or low-adoption outcomes.
翻译:AI安全性正随着AI系统能力的提升和广泛应用而日益成为紧迫问题。现有AI治理的演化模型主要关注安全开发与有效监管的激励机制,通常将用户信任表征为一次性采纳决策,而非由重复互动塑造的动态演化过程。我们转而将信任建模为一种在用户与AI开发者之间重复非对称博弈中的监控削弱行为——在这种互动模式下,核查AI行为需要付出成本。通过演化博弈论,我们研究在不同监控成本与制度框架下,用户信任策略与开发者在安全(合规)与不安全(违规)AI之间的选择如何协同演化。我们以随机有限种群动力学和强化学习(Q-learning)仿真作为无限种群复制者分析的补充。跨方法分析发现三种稳健的长期演化稳态:无采纳伴生不安全开发、不安全但广泛采纳的系统、以及安全且广泛采纳的系统。其中仅最后一种符合期望,其出现条件为:违规惩罚超过安全开发的额外成本,且用户仍能至少承担间歇性监控成本。我们的研究结果为强调透明度、低成本监控及有效制裁的治理方案提供了形式化支持,同时表明:单纯的监管约束或盲目的用户信任均不足以阻止系统向不安全或低采纳率的演化歧途。