Rethinking Model Evaluation as Narrowing the Socio-Technical Gap

The recent development of generative and large language models (LLMs) poses new challenges for model evaluation that the research community and industry are grappling with. While the versatile capabilities of these models ignite excitement, they also inevitably make a leap toward homogenization: powering a wide range of applications with a single, often referred to as ``general-purpose'', model. In this position paper, we argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization: providing valid assessments for whether and how much human needs in downstream use cases can be satisfied by the given model (\textit{socio-technical gap}). By drawing on lessons from the social sciences, human-computer interaction (HCI), and the interdisciplinary field of explainable AI (XAI), we urge the community to develop evaluation methods based on real-world socio-requirements and embrace diverse evaluation methods with an acknowledgment of trade-offs between realism to socio-requirements and pragmatic costs. By mapping HCI and current NLG evaluation methods, we identify opportunities for new evaluation methods for LLMs to narrow the socio-technical gap and pose open questions.

翻译：生成式大语言模型（LLM）的最新发展给研究界和工业界的模型评估带来了新的挑战。尽管这些模型的多功能能力令人兴奋，但它们也不可避免地趋向同质化：用单个（通常称为“通用”）模型支持广泛的应用。在这篇立场论文中，我们认为模型评估实践必须承担一项关键任务，以应对这种同质化带来的挑战和责任：提供有效评估，判断给定模型能在多大程度上满足下游用例中的人类需求（社会技术差距）。通过借鉴社会科学、人机交互（HCI）和可解释人工智能（XAI）跨学科领域的经验教训，我们敦促研究界基于现实世界的社会需求开发评估方法，并在承认社会需求的现实性与实用成本之间权衡的情况下，拥抱多样化的评估方法。通过映射HCI与当前自然语言生成（NLG）评估方法，我们识别出新的LLM评估方法以缩小社会技术差距的机会，并提出开放性问题。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

164+阅读 · 2019年10月12日