Levin Tree Search (LTS) is a search algorithm that makes use of a policy (a probability distribution over actions) and comes with a theoretical guarantee on the number of expansions before reaching a goal node, depending on the quality of the policy. This guarantee can be used as a loss function, which we call the LTS loss, to optimize neural networks representing the policy (LTS+NN). In this work we show that the neural network can be substituted with parameterized context models originating from the online compression literature (LTS+CM). We show that the LTS loss is convex under this new model, which allows for using standard convex optimization tools, and obtain convergence guarantees to the optimal parameters in an online setting for a given set of solution trajectories -- guarantees that cannot be provided for neural networks. The new LTS+CM algorithm compares favorably against LTS+NN on several benchmarks: Sokoban (Boxoban), The Witness, and the 24-Sliding Tile puzzle (STP). The difference is particularly large on STP, where LTS+NN fails to solve most of the test instances while LTS+CM solves each test instance in a fraction of a second. Furthermore, we show that LTS+CM is able to learn a policy that solves the Rubik's cube in only a few hundred expansions, which considerably improves upon previous machine learning techniques.
翻译:莱文树搜索(LTS)是一种利用策略(动作上的概率分布)的搜索算法,并且具有在到达目标节点前扩展次数的理论保证,该保证取决于策略的质量。这一保证可作为损失函数(称为LTS损失),用于优化代表策略的神经网络(LTS+NN)。在本文中,我们展示了神经网络可以被来自在线压缩文献的参数化上下文模型所替代(LTS+CM)。我们证明,在此新模型下,LTS损失是凸函数,这允许使用标准的凸优化工具,并在给定一组解轨迹的在线设置中,获得收敛到最优参数的保证——这种保证无法为神经网络提供。新的LTS+CM算法在多个基准测试上优于LTS+NN:Sokoban(Boxoban)、The Witness和24滑块拼图(STP)。在STP上的差异尤为显著,LTS+NN无法解决大部分测试实例,而LTS+CM可在不到一秒内解决每个测试实例。此外,我们展示了LTS+CM能够学习一种策略,仅需几百次扩展即可解决魔方,这显著改进了先前的机器学习技术。