Large-scale AI systems that combine search and learning have reached super-human levels of performance in game-playing, but have also been shown to fail in surprising ways. The brittleness of such models limits their efficacy and trustworthiness in real-world deployments. In this work, we systematically study one such algorithm, AlphaZero, and identify two phenomena related to the nature of exploration. First, we find evidence of policy-value misalignment -- for many states, AlphaZero's policy and value predictions contradict each other, revealing a tension between accurate move-selection and value estimation in AlphaZero's objective. Further, we find inconsistency within AlphaZero's value function, which causes it to generalize poorly, despite its policy playing an optimal strategy. From these insights we derive VISA-VIS: a novel method that improves policy-value alignment and value robustness in AlphaZero. Experimentally, we show that our method reduces policy-value misalignment by up to 76%, reduces value generalization error by up to 50%, and reduces average value error by up to 55%.
翻译:大规模结合搜索与学习的人工智能系统在游戏对弈中已达到超人水平,但也被发现存在出人意料的失败模式。此类模型的脆弱性限制了其在现实部署中的效能与可信度。本研究系统性地分析了AlphaZero算法,并识别出两种与探索本质相关的现象。首先,我们发现了策略-价值失调的证据——对于许多状态,AlphaZero的策略与价值预测相互矛盾,揭示了其目标函数中精准动作选择与价值估计之间的张力。进一步地,我们发现了AlphaZero价值函数内部的非一致性,这导致其泛化能力较差,尽管其策略能够执行最优方案。基于上述洞见,我们提出了VISA-VIS方法:一种提升AlphaZero策略-价值对齐与价值鲁棒性的新型方法。实验表明,我们的方法可将策略-价值失调降低高达76%,价值泛化误差降低高达50%,平均价值误差降低高达55%。