Prompt-tuning has emerged as a promising method for adapting pre-trained models to downstream tasks or aligning with human preferences. Prompt learning is widely used in NLP but has limited applicability to RL due to the complex physical meaning and environment-specific information contained within RL prompts. These factors require supervised learning to imitate the demonstrations and may result in a loss of meaning after learning. Additionally, directly extending prompt-tuning approaches to RL is challenging because RL prompts guide agent behavior based on environmental modeling and analysis, rather than filling in missing information, making it unlikely that adjustments to the prompt format for downstream tasks, as in NLP, can yield significant improvements. In this work, we propose the Prompt-Tuning DT algorithm to address these challenges by using trajectory segments as prompts to guide RL agents in acquiring environmental information and optimizing prompts via black-box tuning to enhance their ability to contain more relevant information, thereby enabling agents to make better decisions. Our approach involves randomly sampling a Gaussian distribution to fine-tune the elements of the prompt trajectory and using preference ranking function to find the optimization direction, thereby providing more informative prompts and guiding the agent towards specific preferences in the target environment. Extensive experiments show that with only 0.03% of the parameters learned, Prompt-Tuning DT achieves comparable or even better performance than full-model fine-tuning in low-data scenarios. Our work contributes to the advancement of prompt-tuning approaches in RL, providing a promising direction for optimizing large RL agents for specific preference tasks.
翻译:提示调优已成为一种有前途的方法,用于将预训练模型适应下游任务或与人类偏好对齐。提示学习在自然语言处理中广泛应用,但由于强化学习提示中包含复杂的物理含义和环境特定信息,其在RL中的适用性有限。这些因素要求通过监督学习模仿演示,并可能导致学习后含义的丢失。此外,直接将提示调优方法扩展到RL具有挑战性,因为RL提示基于环境建模和分析来指导智能体行为,而不是填补缺失信息,因此像在NLP中那样调整提示格式以适应下游任务不太可能带来显著改进。在本工作中,我们提出提示调优决策变压器算法,通过使用轨迹段作为提示来指导RL智能体获取环境信息,并通过黑盒调优优化提示以增强其包含更多相关信息的能力,从而使智能体做出更好的决策。我们的方法涉及随机采样高斯分布以微调提示轨迹的元素,并使用偏好排序函数寻找优化方向,从而提供更具信息量的提示,并引导智能体朝向目标环境中的特定偏好。大量实验表明,仅学习0.03%的参数,提示调优决策变压器在低数据场景下即可达到与全模型微调相当甚至更优的性能。我们的工作促进了提示调优方法在RL中的进展,为针对特定偏好任务优化大型RL智能体提供了有前景的方向。