We investigate the extent to which contemporary Large Language Models (LLMs) can engage in exploration, a core capability in reinforcement learning and decision making. We focus on native performance of existing LLMs, without training interventions. We deploy LLMs as agents in simple multi-armed bandit environments, specifying the environment description and interaction history entirely in-context, i.e., within the LLM prompt. We experiment with GPT-3.5, GPT-4, and Llama2, using a variety of prompt designs, and find that the models do not robustly engage in exploration without substantial interventions: i) Across all of our experiments, only one configuration resulted in satisfactory exploratory behavior: GPT-4 with chain-of-thought reasoning and an externally summarized interaction history, presented as sufficient statistics; ii) All other configurations did not result in robust exploratory behavior, including those with chain-of-thought reasoning but unsummarized history. Although these findings can be interpreted positively, they suggest that external summarization -- which may not be possible in more complex settings -- is important for obtaining desirable behavior from LLM agents. We conclude that non-trivial algorithmic interventions, such as fine-tuning or dataset curation, may be required to empower LLM-based decision making agents in complex settings.
翻译:本研究探讨了当代大型语言模型(LLMs)在多大程度上能够进行探索——这是强化学习与决策制定中的核心能力。我们聚焦于现有LLMs的原生性能,不施加任何训练干预。我们将LLMs部署为智能体,置于简单的多臂老虎机环境中,并将环境描述与交互历史完全通过上下文(即LLM提示内部)进行指定。我们以GPT-3.5、GPT-4和Llama2为对象,采用多种提示设计方案进行实验,发现模型在没有实质性干预的情况下无法稳健地进行探索:i)在所有实验中,仅有一种配置产生了令人满意的探索行为:使用思维链推理并配合外部汇总的交互历史(以充分统计量形式呈现)的GPT-4;ii)所有其他配置均未产生稳健的探索行为,包括那些采用思维链推理但未汇总历史记录的配置。尽管这些发现可被积极解读,但它们表明外部汇总——这在更复杂场景中可能难以实现——对于从LLM智能体获得理想行为至关重要。我们的结论是,在复杂场景中赋能基于LLM的决策智能体可能需要非平凡的算法干预,例如微调或数据筛选。