We present lilGym, a new benchmark for language-conditioned reinforcement learning in visual environments. lilGym is based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. We introduce a new approach for exact reward computation in every possible world state by annotating all statements with executable Python programs. Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty. We experiment with lilGym with different models and learning regimes. Our results and analysis show that while existing methods are able to achieve non-trivial performance, lilGym forms a challenging open problem. lilGym is available at https://lil.nlp.cornell.edu/lilgym/.
翻译:我们提出了lilGym,一个面向视觉环境中语言条件强化学习的新基准。lilGym基于2,661条高度组合性的人类编写的自然语言语句,这些语句被置于交互式视觉环境中。我们引入了一种新方法,通过为所有语句标注可执行的Python程序,实现在每种可能世界状态下的精确奖励计算。每条语句与多个初始状态和奖励函数配对,形成数千个难度各异的马尔可夫决策过程。我们使用不同模型和学习范式对lilGym进行了实验。结果与分析表明,尽管现有方法能达到非平凡性能,lilGym仍构成了一个具有挑战性的开放问题。lilGym可通过https://lil.nlp.cornell.edu/lilgym/获取。