COINBench: Moving Beyond Individual Perspectives to Collective Intent Understanding

Understanding human intent is a high-level cognitive challenge for Large Language Models (LLMs), requiring sophisticated reasoning over noisy, conflicting, and non-linear discourse. While LLMs excel at following individual instructions, their ability to distill Collective Intent - the process of extracting consensus, resolving contradictions, and inferring latent trends from multi-source public discussions - remains largely unexplored. To bridge this gap, we introduce COIN-BENCH, a dynamic, real-world, live-updating benchmark specifically designed to evaluate LLMs on collective intent understanding within the consumer domain. Unlike traditional benchmarks that focus on transactional outcomes, COIN-BENCH operationalizes intent as a hierarchical cognitive structure, ranging from explicit scenarios to deep causal reasoning. We implement a robust evaluation pipeline that combines a rule-based method with an LLM-as-the-Judge approach. This framework incorporates COIN-TREE for hierarchical cognitive structuring and retrieval-augmented verification (COIN-RAG) to ensure expert-level precision in analyzing raw, collective human discussions. An extensive evaluation of 20 state-of-the-art LLMs across four dimensions - depth, breadth, informativeness, and correctness - reveals that while current models can handle surface-level aggregation, they still struggle with the analytical depth required for complex intent synthesis. COIN-BENCH establishes a new standard for advancing LLMs from passive instruction followers to expert-level analytical agents capable of deciphering the collective voice of the real world. See our project page on COIN-BENCH.

翻译：理解人类意图是大语言模型面临的高阶认知挑战，要求对含噪声、矛盾及非线性语篇进行复杂推理。尽管大语言模型擅长遵循个体指令，但其提炼集体意图——从多方公共讨论中提取共识、化解矛盾并推断潜在趋势的能力，仍鲜有探索。为弥补这一空白，我们提出COIN-BENCH——一个动态、真实、实时更新的基准测试，专门用于评估大语言模型在消费领域的集体意图理解能力。与传统聚焦交易结果的基准不同，COIN-BENCH将意图概念化为层级认知结构，从显式场景延伸至深层因果推理。我们构建了稳健的评估流程，融合基于规则的方法与大语言模型即裁判方法。该框架引入COIN-TREE进行层级认知结构化，并采用检索增强验证确保分析原始集体人类讨论的专家级精度。通过对20个前沿大语言模型在深度、广度、信息量与正确性四个维度的全面评估，我们发现当前模型虽能处理表层聚合，但在复杂意图综合所需的分析深度上仍显不足。COIN-BENCH为将大语言模型从被动指令遵循者提升为能够解读真实世界集体声音的专家级分析主体树立了新标杆。详见COIN-BENCH项目页面。