A User-Centric Benchmark for Evaluating Large Language Models

Large Language Models (LLMs) are essential tools to collaborate with users on different tasks. Evaluating their performance to serve users' needs in real-world scenarios is important. While many benchmarks have been created, they mainly focus on specific predefined model abilities. Few have covered the intended utilization of LLMs by real users. To address this oversight, we propose benchmarking LLMs from a user perspective in both dataset construction and evaluation designs. We first collect 1846 real-world use cases with 15 LLMs from a user study with 712 participants from 23 countries. These self-reported cases form the User Reported Scenarios(URS) dataset with a categorization of 7 user intents. Secondly, on this authentic multi-cultural dataset, we benchmark 10 LLM services on their efficacy in satisfying user needs. Thirdly, we show that our benchmark scores align well with user-reported experience in LLM interactions across diverse intents, both of which emphasize the overlook of subjective scenarios. In conclusion, our study proposes to benchmark LLMs from a user-centric perspective, aiming to facilitate evaluations that better reflect real user needs. The benchmark dataset and code are available at https://github.com/Alice1998/URS.

翻译：大语言模型（LLMs）是协助用户完成不同任务的重要工具。评估其在真实场景中满足用户需求的性能至关重要。尽管已有许多基准测试，但它们主要关注特定的预定义模型能力，鲜有涉及真实用户对大语言模型的预期用途。为弥补这一疏漏，我们提出从用户视角出发，在数据集构建和评估设计两方面对大语言模型进行基准测试。首先，我们通过一项涵盖来自23个国家712名参与者的用户研究，收集了15个大语言模型的1846个真实用例。这些自述案例形成了用户报告场景（URS）数据集，包含7类用户意图。其次，基于这一多文化真实数据集，我们对10种大语言模型服务在满足用户需求方面的有效性进行了基准测试。第三，我们证明该基准测试得分与用户在不同意图下与大语言模型交互时自我报告的体验高度一致，两者均凸显了主观场景被忽视的问题。总之，本研究提出从以用户为中心的视角对大语言模型进行基准测试，旨在推动更贴合真实用户需求的评估。基准数据集和代码可在 https://github.com/Alice1998/URS 获取。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日