Rethinking Model Evaluation as Narrowing the Socio-Technical Gap

The recent development of generative and large language models (LLMs) poses new challenges for model evaluation that the research community and industry are grappling with. While the versatile capabilities of these models ignite excitement, they also inevitably make a leap toward homogenization: powering a wide range of applications with a single, often referred to as ``general-purpose'', model. In this position paper, we argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization: providing valid assessments for whether and how much human needs in downstream use cases can be satisfied by the given model (socio-technical gap). By drawing on lessons from the social sciences, human-computer interaction (HCI), and the interdisciplinary field of explainable AI (XAI), we urge the community to develop evaluation methods based on real-world socio-requirements and embrace diverse evaluation methods with an acknowledgment of trade-offs between realism to socio-requirements and pragmatic costs to conduct the evaluation. By mapping HCI and current NLG evaluation methods, we identify opportunities for evaluation methods for LLMs to narrow the socio-technical gap and pose open questions.

翻译：近年来，生成式大语言模型（LLMs）的发展为研究界和工业界正在应对的模型评估带来了新的挑战。尽管这些模型的多样化能力激发了人们的热情，但它们也不可避免地走向了同质化：即用一个常被称为“通用型”的模型来驱动各类应用。在这篇立场论文中，我们主张模型评估实践必须承担一项关键任务，以应对这种同质化带来的挑战与责任：提供有效的评估，判断给定模型能在多大程度上满足下游用例中的人类需求（即社会技术差距）。通过借鉴社会科学、人机交互（HCI）以及可解释人工智能（XAI）这一跨学科领域的经验教训，我们敦促学术界基于现实世界的社会需求开发评估方法，并接纳多样化的评估方式，同时承认社会需求现实性与评估实践成本之间的权衡。通过映射HCI与当前自然语言生成（NLG）评估方法，我们识别出缩小社会技术差距的LLM评估方法的机会，并提出了若干开放性问题。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日