Q-learning excels in learning from feedback within sequential decision-making tasks but requires extensive sampling for significant improvements. Although reward shaping is a powerful technique for enhancing learning efficiency, it can introduce biases that affect agent performance. Furthermore, potential-based reward shaping is constrained as it does not allow for reward modifications based on actions or terminal states, potentially limiting its effectiveness in complex environments. Additionally, large language models (LLMs) can achieve zero-shot learning, but this is generally limited to simpler tasks. They also exhibit low inference speeds and occasionally produce hallucinations. To address these issues, we propose \textbf{LLM-guided Q-learning} that employs LLMs as heuristic to aid in learning the Q-function for reinforcement learning. It combines the advantages of both technologies without introducing performance bias. Our theoretical analysis demonstrates that the LLM heuristic provides action-level guidance. Additionally, our architecture has the capability to convert the impact of hallucinations into exploration costs. Moreover, the converged Q function corresponds to the MDP optimal Q function. Experiment results demonstrated that our algorithm enables agents to avoid ineffective exploration, enhances sampling efficiency, and is well-suited for complex control tasks.
翻译:Q学习在序列决策任务中从反馈中学习表现出色,但需要大量采样才能实现显著改进。尽管奖励塑形是提升学习效率的强大技术,但可能引入影响智能体性能的偏差。此外,基于势能的奖励塑形存在局限性,不允许根据动作或终止状态修改奖励,这可能在复杂环境中限制其有效性。同时,大语言模型(LLM)能够实现零样本学习,但这通常局限于较简单的任务,且推理速度低,偶尔会产生幻觉。为解决这些问题,我们提出**LLM引导的Q学习**,该方法利用LLM作为启发式方法辅助学习强化学习中的Q函数。它结合了两项技术的优势,且不引入性能偏差。我们的理论分析表明,LLM启发式方法提供动作层面指导。此外,我们的架构能够将幻觉的影响转化为探索成本。同时,收敛后的Q函数对应于MDP最优Q函数。实验结果表明,我们的算法能使智能体避免无效探索,提升采样效率,并适用于复杂控制任务。