Q-learning excels in learning from feedback within sequential decision-making tasks but requires extensive sampling for significant improvements. Although reward shaping is a powerful technique for enhancing learning efficiency, it can introduce biases that affect agent performance. Furthermore, potential-based reward shaping is constrained as it does not allow for reward modifications based on actions, potentially limiting its effectiveness in complex environments. Additionally, large language models (LLMs) can achieve zero-shot learning, but this is generally limited to simpler tasks. They also exhibit low inference speeds and occasionally produce hallucinations. To address these issues, we propose \textbf{LLM-guided Q-learning} that employs LLMs as heuristic to aid in learning the Q-function for reinforcement learning. It combines the advantages of both technologies without introducing performance bias. Our theoretical analysis demonstrates that the LLM heuristic term provides action-level guidance, while the framework can accommodate inaccurate guidance by converting hallucinations into exploration costs. Moreover, the converged Q function corresponds to the MDP optimal Q function. Experiment results demonstrated that our algorithm enables agents to avoid ineffective exploration, enhances sampling efficiency, and is well-suited for complex control tasks.
翻译:Q学习在序列决策任务中擅长从反馈中学习,但需要大量采样才能取得显著改进。虽然奖励塑形是增强学习效率的强大技术,但可能引入影响智能体性能的偏差。此外,基于势能的奖励塑形因不允许基于动作的奖励修改而受到限制,在复杂环境中可能限制其有效性。同时,大语言模型(LLMs)能够实现零样本学习,但这通常仅限于简单任务,且推理速度低并偶尔产生幻觉。为解决这些问题,我们提出\textbf{LLM引导的Q学习},将LLMs作为启发式信息辅助强化学习中的Q函数学习。该方法融合两种技术的优势,且不引入性能偏差。理论分析表明,LLM启发式项提供动作级指导,同时框架可通过将幻觉转化为探索成本来容纳不准确指导。此外,收敛的Q函数对应于MDP最优Q函数。实验结果表明,该算法能使智能体避免无效探索,提升采样效率,并适用于复杂控制任务。