Transformers are neural network models that utilize multiple layers of self-attention heads and have exhibited enormous potential in natural language processing tasks. Meanwhile, there have been efforts to adapt transformers to visual tasks of machine learning, including Vision Transformers and Swin Transformers. Although some researchers use Vision Transformers for reinforcement learning tasks, their experiments remain at a small scale due to the high computational cost. This article presents the first online reinforcement learning scheme that is based on Swin Transformers: Swin DQN. In contrast to existing research, our novel approach demonstrate the superior performance with experiments on 49 games in the Arcade Learning Environment. The results show that our approach achieves significantly higher maximal evaluation scores than the baseline method in 45 of all the 49 games (92%), and higher mean evaluation scores than the baseline method in 40 of all the 49 games (82%).
翻译:Transformer是一种利用多层自注意力头的神经网络模型,在自然语言处理任务中展现出巨大潜力。与此同时,研究者们也在探索将Transformer适配到机器学习视觉任务中,包括Vision Transformer和Swin Transformer。尽管已有学者将Vision Transformer应用于强化学习任务,但受限于高昂的计算成本,其实验规模仍较小。本文首次提出基于Swin Transformer的在线强化学习方案:Swin DQN。与现有研究不同,我们的新方法通过在Arcade学习环境中的49个游戏上进行实验,展现了卓越性能。结果表明,在所有49个游戏中,我们的方法在45个游戏(92%)中达到了显著高于基线方法的最大评估分数,并在40个游戏(82%)中取得了高于基线方法的平均评估分数。