In this work, we introduce a self-supervised behavior cloning transformer for text games, which are challenging benchmarks for multi-step reasoning in virtual environments. Traditionally, Behavior Cloning Transformers excel in such tasks but rely on supervised training data. Our approach auto-generates training data by exploring trajectories (defined by common macro-action sequences) that lead to reward within the games, while determining the generality and utility of these trajectories by rapidly training small models then evaluating their performance on unseen development games. Through empirical analysis, we show our method consistently uncovers generalizable training data, achieving about 90\% performance of supervised systems across three benchmark text games.
翻译:本文提出一种面向文本游戏的自监督行为克隆Transformer方法。文本游戏作为虚拟环境中多步推理的挑战性基准任务,传统行为克隆Transformer虽能胜任此类任务,但依赖监督训练数据。我们的方法通过自动探索游戏中能导向奖励的轨迹(由常见宏动作序列定义)来生成训练数据,并利用快速训练小模型评估其在未见开发游戏上的表现,从而判定这些轨迹的通用性与效用。实验分析表明,该方法能持续发现具有泛化能力的训练数据,在三个基准文本游戏上达到监督系统约90%的性能。