Simulating human reasoning in open-ended tasks has been a long-standing aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), a benchmark for average-to-individual reasoning adaptation. The task is to predict how a specific person would reason and update their beliefs in novel scenarios, given partial evidence of their past views. HugAgent adopts a dual-track design: a synthetic track for scale and systematic stress tests, and a human track for ecologically valid, "out-loud" reasoning data. This design enables scalable, reproducible evaluation of intra-agent fidelity: whether models can capture not just what people believe, but how their reasoning evolves. Experiments with state-of-the-art LLMs reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. Our benchmark and chatbot are open-sourced as HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking (https://anonymous.4open.science/r/trace-your-thinking).
翻译:在开放式任务中模拟人类推理一直是人工智能与认知科学领域的长期追求。尽管大语言模型目前能够大规模地近似人类响应,但它们仍倾向于反映群体层面的共识,往往抹除了推理风格与信念轨迹的个体性。为实现更具人类特质的机器推理愿景,我们提出了HugAgent(基于人类的智能体基准测试),这是一个面向从群体平均到个体推理适应能力的评估基准。其核心任务是在给定个体过往观点部分证据的条件下,预测特定个体在新情境中如何进行推理并更新其信念。HugAgent采用双轨设计:合成数据轨道用于大规模系统性压力测试,人类数据轨道则提供生态效度更高的“出声思维”推理数据。该设计实现了对智能体内在一致性的可扩展、可复现评估——即模型能否捕捉人们相信的内容,更能捕捉其推理过程的演化轨迹。基于前沿大语言模型的实验揭示了持续存在的适应差距,使HugAgent成为首个可扩展的、旨在对齐机器推理与人类思维个体性的基准测试。我们的基准测试与聊天机器人已开源为HugAgent(https://anonymous.4open.science/r/HugAgent)和TraceYourThinking(https://anonymous.4open.science/r/trace-your-thinking)。