Large Language Models (LLMs) are essential tools to collaborate with users on different tasks. Evaluating their performance to serve users' needs in real-world scenarios is important. While many benchmarks have been created, they mainly focus on specific predefined model abilities. Few have covered the intended utilization of LLMs by real users. To address this oversight, we propose benchmarking LLMs from a user perspective in both dataset construction and evaluation designs. We first collect 1846 real-world use cases with 15 LLMs from a user study with 712 participants from 23 countries. These self-reported cases form the User Reported Scenarios(URS) dataset with a categorization of 7 user intents. Secondly, on this authentic multi-cultural dataset, we benchmark 10 LLM services on their efficacy in satisfying user needs. Thirdly, we show that our benchmark scores align well with user-reported experience in LLM interactions across diverse intents, both of which emphasize the overlook of subjective scenarios. In conclusion, our study proposes to benchmark LLMs from a user-centric perspective, aiming to facilitate evaluations that better reflect real user needs. The benchmark dataset and code are available at https://github.com/Alice1998/URS.
翻译:大语言模型(LLMs)是协助用户完成不同任务的重要工具。评估其在真实场景中满足用户需求的性能至关重要。尽管已有许多基准测试,但它们主要关注特定的预定义模型能力,鲜有涉及真实用户对大语言模型的预期用途。为弥补这一疏漏,我们提出从用户视角出发,在数据集构建和评估设计两方面对大语言模型进行基准测试。首先,我们通过一项涵盖来自23个国家712名参与者的用户研究,收集了15个大语言模型的1846个真实用例。这些自述案例形成了用户报告场景(URS)数据集,包含7类用户意图。其次,基于这一多文化真实数据集,我们对10种大语言模型服务在满足用户需求方面的有效性进行了基准测试。第三,我们证明该基准测试得分与用户在不同意图下与大语言模型交互时自我报告的体验高度一致,两者均凸显了主观场景被忽视的问题。总之,本研究提出从以用户为中心的视角对大语言模型进行基准测试,旨在推动更贴合真实用户需求的评估。基准数据集和代码可在 https://github.com/Alice1998/URS 获取。