Offline reinforcement learning (RL) aims to find a near-optimal policy using pre-collected datasets. In real-world scenarios, data collection could be costly and risky; therefore, offline RL becomes particularly challenging when the in-domain data is limited. Given recent advances in Large Language Models (LLMs) and their few-shot learning prowess, this paper introduces $\textbf{La}$nguage Models for $\textbf{Mo}$tion Control ($\textbf{LaMo}$), a general framework based on Decision Transformers to effectively use pre-trained Language Models (LMs) for offline RL. Our framework highlights four crucial components: (1) Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method, in contrast to full-weight fine-tuning, to combine the pre-trained knowledge from LMs and in-domain knowledge effectively, (3) using the non-linear MLP transformation instead of linear projections, to generate embeddings, and (4) integrating an auxiliary language prediction loss during fine-tuning to stabilize the LMs and retain their original abilities on languages. Empirical results indicate $\textbf{LaMo}$ achieves state-of-the-art performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reward tasks. In particular, our method demonstrates superior performance in scenarios with limited data samples. Our project website is https://lamo2023.github.io
翻译:离线强化学习旨在利用预先收集的数据集找到近似最优策略。在实际场景中,数据收集可能代价高昂且存在风险,因此当域内数据有限时,离线强化学习变得尤为具有挑战性。鉴于大语言模型的最新进展及其少样本学习能力,本文提出了一种基于决策Transformer的通用框架——$\textbf{La}$nguage Models for $\textbf{Mo}$tion Control ($\textbf{LaMo}$),以有效利用预训练语言模型进行离线强化学习。该框架包含四个关键组件:(1) 使用顺序预训练语言模型初始化决策Transformer;(2) 采用LoRA微调方法而非全权重微调,以有效融合预训练语言模型的先验知识与域内知识;(3) 使用非线性MLP变换替代线性投影生成嵌入表示;(4) 在微调过程中集成辅助语言预测损失,以稳定语言模型并保留其原始语言能力。实验结果表明,$\textbf{LaMo}$在稀疏奖励任务中达到最优性能,并弥合了基于价值的离线强化学习方法与决策Transformer在密集奖励任务中的差距。特别地,该方法在数据样本有限的场景中展现出卓越性能。项目网站为https://lamo2023.github.io