We improve the proofs of the lower bounds of Coquelin and Munos (2007) that demonstrate that UCT can have $\exp(\dots\exp(1)\dots)$ regret (with $\Omega(D)$ exp terms) on the $D$-chain environment, and that a `polynomial' UCT variant has $\exp_2(\exp_2(D - O(\log D)))$ regret on the same environment -- the original proofs contain an oversight for rewards bounded in $[0, 1]$, which we fix in the present draft. We also adapt the proofs to AlphaGo's MCTS and its descendants (e.g., AlphaZero, Leela Zero) to also show $\exp_2(\exp_2(D - O(\log D)))$ regret.
翻译:我们改进了Coquelin和Munos(2007)的下界证明,该证明表明UCT在D链环境中可能产生$\exp(\dots\exp(1)\dots)$遗憾(包含$\Omega(D)$个指数项),并且一种“多项式”UCT变体在同一环境中产生$\exp_2(\exp_2(D - O(\log D)))$遗憾——原证明对$[0, 1]$有界奖励的处理存在疏漏,本手稿已予以修正。同时我们将这些证明推广至AlphaGo的蒙特卡洛树搜索(MCTS)及其衍生算法(如AlphaZero、Leela Zero),同样证明了$\exp_2(\exp_2(D - O(\log D)))$遗憾。