Large language models such as ChatGPT and GPT-4 have recently achieved astonishing performance on a variety of natural language processing tasks. In this paper, we propose MANGO, a benchmark to evaluate their capabilities to perform text-based mapping and navigation. Our benchmark includes 53 mazes taken from a suite of textgames: each maze is paired with a walkthrough that visits every location but does not cover all possible paths. The task is question-answering: for each maze, a large language model reads the walkthrough and answers hundreds of mapping and navigation questions such as "How should you go to Attic from West of House?" and "Where are we if we go north and east from Cellar?". Although these questions are easy to humans, it turns out that even GPT-4, the best-to-date language model, performs poorly at answering them. Further, our experiments suggest that a strong mapping and navigation ability would benefit large language models in performing relevant downstream tasks, such as playing textgames. Our MANGO benchmark will facilitate future research on methods that improve the mapping and navigation capabilities of language models. We host our leaderboard, data, code, and evaluation program at https://mango.ttic.edu and https://github.com/oaklight/mango/.
翻译:诸如ChatGPT和GPT-4等大语言模型近期在各种自然语言处理任务上取得了惊人表现。本文提出MANGO基准,用于评估模型执行基于文本的地图构建与导航任务的能力。该基准包含来自文本游戏套件的53个迷宫:每个迷宫配有一条遍历所有位置但未覆盖所有可能路径的攻略。任务形式为问答:针对每个迷宫,大语言模型读取攻略后回答数百个地图构建与导航问题,例如“如何从House西侧前往Attic?”以及“从Cellar向北再向东会到达何处?”。尽管这些对人类而言轻而易举,但实验表明,即便当前最优模型GPT-4也难以准确作答。此外,我们的实验表明,强大的地图构建与导航能力有助于大语言模型执行下游任务(如文本游戏)。MANGO基准将推动提升语言模型地图构建与导航能力的未来研究。我们已在https://mango.ttic.edu和https://github.com/oaklight/mango/发布排行榜、数据、代码及评估程序。