Large-scale AI systems that combine search and learning have reached super-human levels of performance in game-playing, but have also been shown to fail in surprising ways. The brittleness of such models limits their efficacy and trustworthiness in real-world deployments. In this work, we systematically study one such algorithm, AlphaZero, and identify two phenomena related to the nature of exploration. First, we find evidence of policy-value misalignment -- for many states, AlphaZero's policy and value predictions contradict each other, revealing a tension between accurate move-selection and value estimation in AlphaZero's objective. Further, we find inconsistency within AlphaZero's value function, which causes it to generalize poorly, despite its policy playing an optimal strategy. From these insights we derive VISA-VIS: a novel method that improves policy-value alignment and value robustness in AlphaZero. Experimentally, we show that our method reduces policy-value misalignment by up to 76%, reduces value generalization error by up to 50%, and reduces average value error by up to 55%.
翻译:大规模结合搜索与学习的人工智能系统在博弈中已达到超人类水平,但也被发现以令人意外的方式失败。此类模型的脆弱性限制了其在实际部署中的有效性和可信度。本研究系统性地分析了AlphaZero算法,并识别出与探索性质相关的两种现象。首先,我们发现策略-价值不一致的证据——在多个状态下,AlphaZero的策略预测与价值预测相互矛盾,揭示了其目标函数中精确选步与价值评估之间的张力。进一步,我们发现AlphaZero价值函数内部存在不一致性,这导致尽管其策略能执行最优策略,但泛化能力较差。基于这些洞见,我们提出了VISA-VIS方法:一种改进AlphaZero中策略-价值对齐与价值鲁棒性的新方法。实验表明,我们的方法将策略-价值不一致性最多降低76%,将价值泛化误差最多降低50%,并将平均价值误差最多降低55%。