Despite the rapid progress, LLMs for sequential decision-making (i.e., LLM agents) still struggle to produce diverse outputs. This leads to insufficient exploration, convergence to sub-optimal solutions, and becoming stuck in loops. Such limitations can be problematic in environments that require active exploration to gather information and make decisions. Sampling methods such as temperature scaling introduce token-level randomness but fail to produce enough diversity at the sequence level. We analyze LLM exploration in the classic Multi-Armed Bandit (MAB) setting and the Text Adventure Learning Environment Suite (TALES). We find that current decoding strategies and prompting methods like Chain-of-Thought and Tree-of-Thought are insufficient for robust exploration. To address this, we introduce DORA Explorer (Diversity-Oriented Ranking of Actions), a training-free framework for improving exploration in LLM agents. DORA generates diverse action candidates, scores them using token log-probabilities, and selects actions using a tunable exploration parameter. DORA achieves UCB-competitive performance on MAB and consistent gains across TALES, e.g., improving Qwen2.5-7B's performance from 29.2% to 45.5% in TextWorld. Our project is available at: https://dora-explore.github.io/.
翻译:尽管进展迅速,用于序列决策的大语言模型(即LLM智能体)在生成多样化输出方面仍存在不足。这导致探索不充分、收敛至次优解以及陷入循环等问题。在需要主动探索以收集信息并做出决策的环境中,此类局限尤为突出。温度缩放等采样方法虽能引入词元级随机性,但无法在序列层面产生足够的多样性。我们在经典多臂老虎机(MAB)设置与文本冒险学习环境套件(TALES)中分析了LLM的探索能力,发现当前解码策略及思维链(Chain-of-Thought)、思维树(Tree-of-Thought)等提示方法不足以实现稳健探索。为此,我们提出DORA Explorer(面向多样性的动作排序框架),这是一种无需训练即可增强LLM智能体探索能力的框架。DORA生成多样化候选动作,利用词元对数概率对其进行评分,并通过可调探索参数选择动作。在MAB任务中,DORA达到与UCB(置信上界算法)相媲美的性能;在TALES任务中表现持续提升,例如将Qwen2.5-7B在TextWorld中的性能从29.2%提升至45.5%。项目主页:https://dora-explore.github.io/。