Agentic benchmarks increasingly rely on LLM-simulated users to scalably evaluate agent performance, yet the robustness, validity, and fairness of this approach remain unexamined. Through a user study with participants across the United States, India, Kenya, and Nigeria, we investigate whether LLM-simulated users serve as reliable proxies for real human users in evaluating agents on τ-Bench retail tasks. We find that user simulation lacks robustness, with agent success rates varying up to 9 percentage points across different user LLMs. Furthermore, evaluations using simulated users exhibit systematic miscalibration, underestimating agent performance on challenging tasks and overestimating it on moderately difficult ones. African American Vernacular English (AAVE) speakers experience consistently worse success rates and calibration errors than Standard American English (SAE) speakers, with disparities compounding significantly with age. We also find simulated users to be a differentially effective proxy for different populations, performing worst for AAVE and Indian English speakers. Additionally, simulated users introduce conversational artifacts and surface different failure patterns than human users. These findings demonstrate that current evaluation practices risk misrepresenting agent capabilities across diverse user populations and may obscure real-world deployment challenges.
翻译:智能体基准测试日益依赖LLM模拟用户来扩展评估智能体性能,然而这种方法的鲁棒性、有效性和公平性尚未得到检验。通过一项涵盖美国、印度、肯尼亚和尼日利亚参与者的用户研究,我们探究了在τ-Bench零售任务评估中,LLM模拟用户是否能作为真实人类用户的可靠代理。研究发现用户模拟缺乏鲁棒性——不同用户LLM之间的智能体成功率差异高达9个百分点。此外,使用模拟用户的评估存在系统性校准偏差:在挑战性任务中低估智能体性能,在中等难度任务中则高估其表现。与非裔美国人白话英语(AAVE)使用者相比,标准美式英语(SAE)使用者的成功率持续偏低且校准误差更大,这种差异随年龄增长显著加剧。研究还发现模拟用户对不同人群的代理效能存在差异,对AAVE和印度英语使用者的代理效果最差。同时,模拟用户会引入对话伪影,并呈现出与人类用户不同的故障模式。这些发现表明,当前的评估实践可能扭曲智能体在不同用户群体中的能力表征,并掩盖实际部署中的挑战。