Large Language Models (LLMs) are essential tools to collaborate with users on different tasks. Evaluating their performance to serve users' needs in real-world scenarios is important. While many benchmarks have been created, they mainly focus on specific predefined model abilities. Few have covered the intended utilization of LLMs by real users. To address this oversight, we propose benchmarking LLMs from a user perspective in both dataset construction and evaluation designs. We first collect 1863 real-world use cases with 15 LLMs from a user study with 712 participants from 23 countries. These self-reported cases form the User Reported Scenarios(URS) dataset with a categorization of 7 user intents. Secondly, on this authentic multi-cultural dataset, we benchmark 10 LLM services on their efficacy in satisfying user needs. Thirdly, we show that our benchmark scores align well with user-reported experience in LLM interactions across diverse intents, both of which emphasize the overlook of subjective scenarios. In conclusion, our study proposes to benchmark LLMs from a user-centric perspective, aiming to facilitate evaluations that better reflect real user needs. The benchmark dataset and code are available at https://github.com/Alice1998/URS.
翻译:大型语言模型(LLMs)是与用户协作完成不同任务的重要工具。评估其在真实场景中满足用户需求的性能至关重要。尽管已有许多基准测试被创建,但它们主要侧重于特定的预定义模型能力,很少覆盖真实用户对LLMs的预期用途。为解决这一疏漏,我们提出从用户视角出发,在数据集构建和评估设计两方面对LLMs进行基准测试。首先,我们从一项涉及23个国家712名参与者的用户研究中,收集了使用15个LLMs的1863个真实用例。这些自我报告的案例形成了用户报告场景(URS)数据集,并归纳出7种用户意图类别。其次,在此真实的多文化数据集上,我们对10个LLM服务满足用户需求的效能进行了基准测试。第三,我们证明基准测试得分与用户在不同意图下报告的使用体验高度一致,两者均突显了主观场景被忽视的问题。总之,本研究提出从用户中心视角对LLMs进行基准测试,旨在推动更能反映真实用户需求的评估。基准测试数据集和代码可在https://github.com/Alice1998/URS获取。