Agentic benchmarks increasingly rely on LLM-simulated users to scalably evaluate agent performance, yet the robustness, validity, and fairness of this approach remain unexamined. Through a user study with participants across the United States, India, Kenya, and Nigeria, we investigate whether LLM-simulated users serve as reliable proxies for real human users in evaluating agents on τ-Bench retail tasks. We find that user simulation lacks robustness, with agent success rates varying up to 9 percentage points across different user LLMs. Furthermore, evaluations using simulated users exhibit systematic miscalibration, underestimating agent performance on challenging tasks and overestimating it on moderately difficult ones. African American Vernacular English (AAVE) speakers experience consistently worse success rates and calibration errors than Standard American English (SAE) speakers, with disparities compounding significantly with age. We also find simulated users to be a differentially effective proxy for different populations, performing worst for AAVE and Indian English speakers. Additionally, simulated users introduce conversational artifacts and surface different failure patterns than human users. These findings demonstrate that current evaluation practices risk misrepresenting agent capabilities across diverse user populations and may obscure real-world deployment challenges.
翻译:智能体基准测试日益依赖LLM仿真用户进行可扩展的性能评估,然而这种方法的鲁棒性、有效性和公平性尚未得到检验。通过在美国、印度、肯尼亚和尼日利亚开展用户研究,我们探究了在τ-Bench零售任务评估中,LLM仿真用户是否能作为真实人类用户的可靠代理。研究发现用户仿真缺乏鲁棒性——不同用户LLM间的智能体成功率差异高达9个百分点。此外,使用仿真用户的评估存在系统性校准偏差:在挑战性任务中低估智能体性能,在中等难度任务中则高估其表现。与非裔美国人白话英语(AAVE)使用者相比,标准美式英语(SAE)使用者的成功率持续偏低且校准误差更大,这种差异随年龄增长显著加剧。研究还发现仿真用户对不同群体的代理效果存在差异,对AAVE和印度英语使用者的代理效果最差。仿真用户会引入对话伪影,其呈现的故障模式也与人类用户不同。这些结果表明,当前评估方法可能扭曲智能体在不同用户群体中的实际能力,并掩盖现实部署中的真实挑战。