Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

Many emerging applications of AI--from scientific discovery to medical diagnosis--require agents to seek information strategically: forming hypotheses, asking targeted questions, and making decisions under uncertainty. In high-stakes settings with limited resources, do language models (LMs) behave like rational agents? Drawing on insights from human cognition, we develop methods to evaluate and enhance agentic information-seeking. First, we introduce a decision-oriented dialogue task called Collaborative Battleship, in which a Captain must balance exploration (asking questions) and action (taking shots), while a Spotter must supply accurate, contextually-grounded answers. Compared to human players (N=42), we find that many LM agents struggle to ask informative questions, produce accurate answers, and identify high-utility actions. To address these gaps, we develop novel Monte Carlo inference strategies for LMs inspired by Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303-0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% -> 82% win rate) and frontier models (0% -> 67% win rate vs. GPT-5) at ~1% of GPT-5's cost. We replicate these findings on Guess Who?, where our methods significantly boost accuracy (+28.3-42.4 p.p.), demonstrating their general applicability for building information-seeking agents.

翻译：人工智能的许多新兴应用——从科学发现到医疗诊断——都需要智能体以策略性方式获取信息：形成假设、提出针对性问题并在不确定性下做出决策。在资源有限的高风险场景中，语言模型（LMs）能否像理性智能体一样行动？借鉴人类认知的洞见，我们开发了评估和增强智能体信息寻求能力的方法。首先，我们引入了一个面向决策的对话任务“协作战舰游戏”，其中舰长必须在探索（提问）与行动（射击）之间取得平衡，而观察员必须提供准确且基于情境的回答。与人类玩家（N=42）相比，我们发现许多LM智能体在提出信息性问题、生成准确答案以及识别高效用行动方面存在困难。为弥补这些差距，我们受贝叶斯实验设计（BED）启发，为LMs开发了新颖的蒙特卡洛推理策略。对于观察员智能体，我们的方法相比纯LM基线将准确率绝对提升了高达14.7%；对于舰长智能体，它将期望信息增益（EIG）提升了高达0.227比特（达到可实现噪声上限的94.2%）。这些组件相结合实现了更精准的目标定位（F1值提升0.303-0.374），并使较弱模型如Llama-4-Scout能以约GPT-5成本的1%同时超越人类（胜率从8%提升至82%）和前沿模型（对GPT-5胜率从0%提升至67%）。我们在“猜猜谁？”游戏中复现了这些发现，我们的方法显著提升了准确率（+28.3-42.4个百分点），证明了其在构建信息寻求智能体方面的普适性。