A User-Centric Benchmark for Evaluating Large Language Models

Large Language Models (LLMs) are essential tools to collaborate with users on different tasks. Evaluating their performance to serve users' needs in real-world scenarios is important. While many benchmarks have been created, they mainly focus on specific predefined model abilities. Few have covered the intended utilization of LLMs by real users. To address this oversight, we propose benchmarking LLMs from a user perspective in both dataset construction and evaluation designs. We first collect 1863 real-world use cases with 15 LLMs from a user study with 712 participants from 23 countries. These self-reported cases form the User Reported Scenarios(URS) dataset with a categorization of 7 user intents. Secondly, on this authentic multi-cultural dataset, we benchmark 10 LLM services on their efficacy in satisfying user needs. Thirdly, we show that our benchmark scores align well with user-reported experience in LLM interactions across diverse intents, both of which emphasize the overlook of subjective scenarios. In conclusion, our study proposes to benchmark LLMs from a user-centric perspective, aiming to facilitate evaluations that better reflect real user needs. The benchmark dataset and code are available at https://github.com/Alice1998/URS.

翻译：大型语言模型（LLMs）是与用户协作完成不同任务的重要工具。评估其在真实场景中满足用户需求的性能至关重要。尽管已有许多基准测试被创建，但它们主要侧重于特定的预定义模型能力，很少覆盖真实用户对LLMs的预期用途。为解决这一疏漏，我们提出从用户视角出发，在数据集构建和评估设计两方面对LLMs进行基准测试。首先，我们从一项涉及23个国家712名参与者的用户研究中，收集了使用15个LLMs的1863个真实用例。这些自我报告的案例形成了用户报告场景（URS）数据集，并归纳出7种用户意图类别。其次，在此真实的多文化数据集上，我们对10个LLM服务满足用户需求的效能进行了基准测试。第三，我们证明基准测试得分与用户在不同意图下报告的使用体验高度一致，两者均突显了主观场景被忽视的问题。总之，本研究提出从用户中心视角对LLMs进行基准测试，旨在推动更能反映真实用户需求的评估。基准测试数据集和代码可在https://github.com/Alice1998/URS获取。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日