Levin Tree Search (LTS) is a search algorithm that makes use of a policy (a probability distribution over actions) and comes with a theoretical guarantee on the number of expansions before reaching a goal node, depending on the quality of the policy. This guarantee can be used as a loss function, which we call the LTS loss, to optimize neural networks representing the policy (LTS+NN). In this work we show that the neural network can be substituted with parameterized context models originating from the online compression literature (LTS+CM). We show that the LTS loss is convex under this new model, which allows for using standard convex optimization tools, and obtain convergence guarantees to the optimal parameters in an online setting for a given set of solution trajectories -- guarantees that cannot be provided for neural networks. The new LTS+CM algorithm compares favorably against LTS+NN on several benchmarks: Sokoban (Boxoban), The Witness, and the 24-Sliding Tile puzzle (STP). The difference is particularly large on STP, where LTS+NN fails to solve most of the test instances while LTS+CM solves each test instance in a fraction of a second. Furthermore, we show that LTS+CM is able to learn a policy that solves the Rubik's cube in only a few hundred expansions, which considerably improves upon previous machine learning techniques.
翻译:Levin树搜索(LTS)是一种利用策略(动作上的概率分布)的搜索算法,并具有在达到目标节点前关于扩展次数的理论保证,该保证依赖于策略的质量。该保证可用作损失函数(我们称之为LTS损失),以优化代表策略的神经网络(LTS+NN)。在本工作中,我们表明神经网络可以用来自在线压缩文献中的参数化上下文模型(LTS+CM)替代。我们证明,在这种新模型下,LTS损失是凸的,从而允许使用标准的凸优化工具,并在给定一组解轨迹的在线设置中获得最优参数的收敛保证——而神经网络无法提供这种保证。新的LTS+CM算法在多个基准测试中优于LTS+NN:Sokoban(Boxoban)、The Witness和24滑块谜题(STP)。差异在STP上尤为显著,LTS+NN无法解决大多数测试实例,而LTS+CM在不到一秒的时间内解决了每个测试实例。此外,我们表明LTS+CM能够在仅几百次扩展内学习解决魔方的策略,这显著改进了先前的机器学习技术。