HugAgent: Evaluating LLMs in Simulating Human-Like Individual Reasoning on Open-Ended Tasks

Chance Jiajie Li,Zhenze Mo,Yuhan Tang,Ao Qu,Jiayi Wu,Kaiya Ivy Zhao,Yulu Gan,Jie Fan,Jiangbo Yu,Hang Jiang,Paul Pu Liang,Jinhua Zhao,Luis Alberto Alonso Pastor,Kent Larson

from arxiv, To appear in NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models (LAW)

Simulating human reasoning in open-ended tasks has been a long-standing aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), a benchmark for average-to-individual reasoning adaptation. The task is to predict how a specific person would reason and update their beliefs in novel scenarios, given partial evidence of their past views. HugAgent adopts a dual-track design: a synthetic track for scale and systematic stress tests, and a human track for ecologically valid, "out-loud" reasoning data. This design enables scalable, reproducible evaluation of intra-agent fidelity: whether models can capture not just what people believe, but how their reasoning evolves. Experiments with state-of-the-art LLMs reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. Our benchmark and chatbot are open-sourced as HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking (https://anonymous.4open.science/r/trace-your-thinking).

翻译：在开放式任务中模拟人类推理一直是人工智能与认知科学领域的长期追求。尽管大语言模型目前能够大规模地近似人类响应，但它们仍倾向于反映群体层面的共识，往往抹除了推理风格与信念轨迹的个体性。为实现更具人类特质的机器推理愿景，我们提出了HugAgent（基于人类的智能体基准测试），这是一个面向从群体平均到个体推理适应能力的评估基准。其核心任务是在给定个体过往观点部分证据的条件下，预测特定个体在新情境中如何进行推理并更新其信念。HugAgent采用双轨设计：合成数据轨道用于大规模系统性压力测试，人类数据轨道则提供生态效度更高的“出声思维”推理数据。该设计实现了对智能体内在一致性的可扩展、可复现评估——即模型能否捕捉人们相信的内容，更能捕捉其推理过程的演化轨迹。基于前沿大语言模型的实验揭示了持续存在的适应差距，使HugAgent成为首个可扩展的、旨在对齐机器推理与人类思维个体性的基准测试。我们的基准测试与聊天机器人已开源为HugAgent（https://anonymous.4open.science/r/HugAgent）和TraceYourThinking（https://anonymous.4open.science/r/trace-your-thinking）。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日