We propose using validated behavioral hypotheses as a lens for evaluating human-likeness in LLM-based agents. Our key idea is simple: If an agent is human-like, a population of such agents should reach the same inferential conclusion as the human population when run through the same experiment. Decades of social science have produced many such validated findings, each anchored to concrete experimental protocols and robustly established through independent replication. This yields an evaluation that is objective, decomposable, and scalable. We operationalize this lens through HumanStudy-Bench, an open platform that turns published human-subject studies into reusable simulation environments and administers the evaluation to configurable agents. It scores agent-human alignment on two metrics: the Probability Alignment Score (PAS) for inferential agreement and the Effect Consistency Score (ECS) for effect-size agreement. We curated an initial suite of 12 studies whose hypotheses are robustly established through independent replication, and evaluated 10 models under 4 agent designs. Results show that agent responses polarize between full replication and complete failure; agent design influences alignment more than model scale, but its effect is non-monotonic.
翻译:我们提出将验证性行为假设作为评估基于大语言模型的代理类人性的视角。核心思想简单明了:若代理具有类人特性,则当执行相同实验时,由该类代理构成的群体应与人类群体得出相同的推断结论。数十年来社会科学领域已积累众多此类验证性发现,每个发现均锚定于具体实验方案,并通过独立重复实验获得稳健验证。由此形成的评估体系具有客观性、可分解性及可扩展性。我们通过HumanStudy-Bench平台将该视角付诸实践——该开放平台将已发表的人类受试者研究转化为可复用仿真环境,并对可配置代理实施评估。评估采用两项指标衡量代理与人类的契合度:推断一致性概率对齐分数(PAS)与效应规模一致性效应匹配分数(ECS)。我们精选了12项经独立重复实验验证的初始研究套件,并在4种代理设计下评估了10个模型。结果表明:代理响应呈现完全复现与彻底失败的两极分化;代理设计对契合度的影响大于模型规模,但该影响呈非单调性。