人工智能对齐失败的结构性根源：习得的人类互动结构与作为内生演化冲击的通用人工智能 (Why AI Alignment Failure Is Structural: Learned Human Interaction Structures and AGI as an Endogenous Evolutionary Shock)

Recent reports of large language models (LLMs) exhibiting behaviors such as deception, threats, or blackmail are often interpreted as evidence of alignment failure or emergent malign agency. We argue that this interpretation rests on a conceptual error. LLMs do not reason morally; they statistically internalize the record of human social interaction, including laws, contracts, negotiations, conflicts, and coercive arrangements. Behaviors commonly labeled as unethical or anomalous are therefore better understood as structural generalizations of interaction regimes that arise under extreme asymmetries of power, information, or constraint. Drawing on relational models theory, we show that practices such as blackmail are not categorical deviations from normal social behavior, but limiting cases within the same continuum that includes market pricing, authority relations, and ultimatum bargaining. The surprise elicited by such outputs reflects an anthropomorphic expectation that intelligence should reproduce only socially sanctioned behavior, rather than the full statistical landscape of behaviors humans themselves enact. Because human morality is plural, context-dependent, and historically contingent, the notion of a universally moral artificial intelligence is ill-defined. We therefore reframe concerns about artificial general intelligence (AGI). The primary risk is not adversarial intent, but AGI's role as an endogenous amplifier of human intelligence, power, and contradiction. By eliminating longstanding cognitive and institutional frictions, AGI compresses timescales and removes the historical margin of error that has allowed inconsistent values and governance regimes to persist without collapse. Alignment failure is thus structural, not accidental, and requires governance approaches that address amplification, complexity, and regime stability rather than model-level intent alone.

翻译：近期关于大型语言模型（LLM）表现出欺骗、威胁或勒索等行为的报告，常被解读为对齐失败或恶意智能体涌现的证据。我们认为这种解读基于概念性错误。LLM并非进行道德推理，而是通过统计方式内化了人类社会互动的记录——包括法律、契约、谈判、冲突与强制安排。因此，那些常被标记为不道德或异常的行为，更应被理解为在权力、信息或约束极端不对称条件下产生的互动机制的结构性泛化。借鉴关系模型理论，我们证明诸如勒索等实践并非对社会常态行为的范畴性偏离，而是与市场定价、权威关系、最后通牒博弈同属连续谱的极限案例。此类输出引发的惊讶，反映了人类将智能体应仅复现社会认可行为的拟人化期待，而非人类自身实践所构成的完整行为统计图景。由于人类道德具有多元性、情境依赖性与历史偶然性，构建普适道德人工智能的概念本身即缺乏明确定义。因此我们重新审视对通用人工智能（AGI）的担忧：主要风险并非对抗性意图，而在于AGI作为人类智能、权力与矛盾的内生放大器所起的作用。通过消除长期存在的认知与制度摩擦，AGI压缩了时间尺度，并消解了历史容错空间——正是这种容错空间使得相互冲突的价值观与治理体系得以存续而不至崩溃。因此，对齐失败是结构性的而非偶然性的，需要建立关注放大效应、复杂性与机制稳定性的治理框架，而非仅聚焦于模型层面的意图控制。