Large language models (LLMs) have recently been used for sequential decision making in interactive environments. However, leveraging environment reward signals for continual LLM actor improvement is not straightforward. We propose Skill Set Optimization (SSO) for improving LLM actor performance through constructing and refining sets of transferable skills. SSO constructs skills by extracting common subtrajectories with high rewards and generating subgoals and instructions to represent each skill. These skills are provided to the LLM actor in-context to reinforce behaviors with high rewards. Then, SSO further refines the skill set by pruning skills that do not continue to result in high rewards. We evaluate our method in the classic videogame NetHack and the text environment ScienceWorld to demonstrate SSO's ability to optimize a set of skills and perform in-context policy improvement. SSO outperforms baselines by 40% in our custom NetHack task and outperforms the previous state-of-the-art in ScienceWorld by 35%.
翻译:大型语言模型(LLMs)最近被用于交互环境中的序列决策制定。然而,利用环境奖励信号持续改进LLM执行器并非易事。我们提出技能集优化(SSO)方法,通过构建和优化可迁移技能集来提升LLM执行器性能。SSO通过提取高奖励公共子轨迹并生成子目标和指令来表征每个技能,从而构建技能。这些技能以上下文方式提供给LLM执行器,以强化高奖励行为。随后,SSO通过修剪不再持续产生高奖励的技能来进一步优化技能集。我们在经典电子游戏NetHack和文本环境ScienceWorld中评估了该方法,以证明SSO优化技能集和执行上下文策略改进的能力。在我们自定义的NetHack任务中,SSO性能比基线模型提升40%,在ScienceWorld中相较先前最先进方法提升35%。