Large language models (LLMs) are increasingly used to meet user information needs, but their effectiveness in dealing with user queries that contain various types of ambiguity remains unknown, ultimately risking user trust and satisfaction. To this end, we introduce CLAMBER, a benchmark for evaluating LLMs using a well-organized taxonomy. Building upon the taxonomy, we construct ~12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs. Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries, even enhanced by chain-of-thought (CoT) and few-shot prompting. These techniques may result in overconfidence in LLMs and yield only marginal enhancements in identifying ambiguity. Furthermore, current LLMs fall short in generating high-quality clarifying questions due to a lack of conflict resolution and inaccurate utilization of inherent knowledge. In this paper, CLAMBER presents a guidance and promotes further research on proactive and trustworthy LLMs. Our dataset is available at https://github.com/zt991211/CLAMBER
翻译:大型语言模型(LLMs)正被越来越多地用于满足用户的信息需求,但它们在处理包含多种模糊类型的用户查询时的有效性仍属未知,最终可能影响用户信任与满意度。为此,我们提出CLAMBER——一个基于系统化分类法评估LLMs的基准测试。基于该分类法,我们构建了约1.2万条高质量数据,以评估多种现有LLMs的优势、不足及潜在风险。研究结果表明,当前LLMs在识别和澄清模糊用户查询方面的实际效用有限,即使通过思维链(CoT)和少样本提示增强后,这些技术仍可能导致LLMs过度自信,且仅能实现识别模糊性的微弱改进。此外,当前LLMs在生成高质量澄清问题时表现不足,这归因于缺乏冲突解决机制以及对固有知识的错误利用。本文中,CLAMBER为主动且可信赖的LLMs提供了指导,并促进了相关研究进展。我们的数据集发布于https://github.com/zt991211/CLAMBER