Large Language Models (LLMs) excel at many tasks but still struggle with a critical ability for LLM-based agents: asking good questions for resolving ambiguity in user requests. While prior work has explored information-seeking behavior through word games, existing benchmarks lack comprehensive evaluation frameworks that provide both final and intermediate signals based on Information Gain (IG). Moreover, they rarely provide systematic comparisons between models that use chain-of-thought reasoning and those that do not. We propose a multi-turn dialogue framework that quantitatively measures how effectively LLMs gather information through yes/no questions in a hierarchical knowledge graph environment. Our framework employs a triad of interacting LLM agents that ask questions, answer them, and update the hypothesis space. We adopt IG as the main metric, grounded in Shannon entropy, to assess query effectiveness at each turn and cumulatively. We instantiate our framework in a geographical Guess My City game setting organized in a five-level taxonomy and evaluate multiple LLM variants under fully and partially observable conditions, with and without Chain-of-Thought reasoning. Our experiments demonstrate that, among the evaluated models, the ones with explicit reasoning capabilities achieve higher IG per turn and reach solutions in fewer steps, particularly in partially observable settings. Analysis of reasoning traces reveals that smaller models compensate for limited capacity through more aggressive exploration of candidate questions, while larger models exhibit higher assertiveness in selecting optimal queries, generating candidates with greater potential IG.
翻译:大语言模型(LLM)在众多任务中表现卓越,但在基于LLM的智能体所需的关键能力——通过提出优质问题以消除用户请求中的歧义——方面仍存在不足。尽管先前研究已通过文字游戏探索了信息寻求行为,现有基准测试缺乏基于信息增益(IG)的、能同时提供最终与中间信号的综合评估框架。此外,这些研究很少对使用思维链推理的模型与未使用该机制的模型进行系统比较。本文提出一种多轮对话框架,用于定量评估LLM在分层知识图谱环境中通过是非问题收集信息的效率。该框架采用由三个交互LLM智能体构成的系统:提问者、回答者与假设空间更新者。我们以香农熵为基础,采用IG作为核心指标,逐轮并累积地评估查询有效性。我们将该框架实例化为基于五级分类体系的地理猜城市游戏场景,并在完全可观测与部分可观测条件下,对使用及未使用思维链推理的多种LLM变体进行评估。实验结果表明,在已评估的模型中,具备显式推理能力的模型在单轮中实现更高的IG,并以更少步骤达成解决方案,尤其在部分可观测条件下表现显著。对推理轨迹的分析显示,较小模型通过更积极地探索候选问题来弥补能力局限,而较大模型在选择最优查询时表现出更高的决断性,其生成的候选问题具有更大的潜在IG。