Transformers Can Navigate Mazes With Multi-Step Prediction

Despite their remarkable success in language modeling, transformers trained to predict the next token in a sequence struggle with long-term planning. This limitation is particularly evident in tasks requiring foresight to plan multiple steps ahead such as maze navigation. The standard next single token prediction objective, however, offers no explicit mechanism to predict multiple steps ahead - or revisit the path taken so far. Consequently, in this work we study whether explicitly predicting multiple steps ahead (and backwards) can improve transformers' maze navigation. We train parameter-matched transformers from scratch, under identical settings, to navigate mazes of varying types and sizes with standard next token prediction and MLM-U, an objective explicitly predicting multiple steps ahead and backwards. We find that MLM-U considerably improves transformers' ability to navigate mazes compared to standard next token prediction across maze types and complexities. We also find MLM-U training is 4x more sample efficient and converges 2x faster in terms of GPU training hours relative to next token training. Finally, for more complex mazes we find MLM-U benefits from scaling to larger transformers. Remarkably, we find transformers trained with MLM-U outperform larger transformers trained with next token prediction using additional supervision from A* search traces. We hope these findings underscore the promise of learning objectives to advance transformers' capacity for long-term planning. The code can be found at https://github.com/facebookresearch/maze_navigation_MLMU

翻译：尽管Transformer在语言建模方面取得了显著成功，但经过序列下一词元预测训练的Transformer在长期规划方面仍存在困难。这一局限在需要前瞻性多步规划的任务（如迷宫导航）中尤为明显。然而，标准的下一词元预测目标并未提供显式机制以预测多步未来路径或回溯已探索路径。因此，本研究旨在探究显式预测多步未来（及回溯）路径是否能提升Transformer的迷宫导航能力。我们在相同设置下从头训练参数匹配的Transformer，使其在标准下一词元预测与MLM-U（一种显式预测多步未来及回溯路径的目标函数）下导航不同类型和规模的迷宫。实验发现，相较于标准下一词元预测，MLM-U能显著提升Transformer在各种类型和复杂度迷宫中的导航能力。同时，MLM-U训练具有4倍的样本效率优势，且GPU训练时长的收敛速度提升2倍。此外，对于更复杂的迷宫，MLM-U能从更大规模的Transformer扩展中获益。值得注意的是，采用MLM-U训练的Transformer甚至优于通过A*搜索轨迹进行额外监督、采用下一词元预测训练的更大规模Transformer。我们希望这些发现能凸显学习目标函数在提升Transformer长期规划能力方面的潜力。相关代码可在https://github.com/facebookresearch/maze_navigation_MLMU 获取。