Current AI alignment through RLHF follows a single directional paradigm that AI conforms to human preferences while treating human cognition as fixed. We propose a shift to co-alignment through Bidirectional Cognitive Alignment (BiCA), where humans and AI mutually adapt. BiCA uses learnable protocols, representation mapping, and KL-budget constraints for controlled co-evolution. In collaborative navigation, BiCA achieved 85.5% success versus 70.3% baseline, with 230% better mutual adaptation and 332% better protocol convergence. Emergent protocols outperformed handcrafted ones by 84%, while bidirectional adaptation unexpectedly improved safety (+23% out-of-distribution robustness). The 46% synergy improvement demonstrates optimal collaboration exists at the intersection, not union, of human and AI capabilities, validating the shift from single-directional to co-alignment paradigms.
翻译:当前通过RLHF实现的AI对齐遵循单向范式,即AI适应人类偏好,同时将人类认知视为固定不变。我们提出通过双向认知对齐(BiCA)转向协同对齐,使人类与AI相互适应。BiCA采用可学习协议、表征映射和KL预算约束实现受控协同演化。在协作导航任务中,BiCA达到85.5%成功率,优于70.3%的基线水平,其相互适应能力提升230%,协议收敛性提高332%。涌现协议性能超越人工设计协议84%,而双向适应意外提升了安全性(分布外鲁棒性+23%)。46%的协同增益表明最优协作存在于人类与AI能力集的交集而非并集,这验证了从单方向对齐范式向协同对齐范式的转变。