The recent development of generative and large language models (LLMs) poses new challenges for model evaluation that the research community and industry are grappling with. While the versatile capabilities of these models ignite excitement, they also inevitably make a leap toward homogenization: powering a wide range of applications with a single, often referred to as ``general-purpose'', model. In this position paper, we argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization: providing valid assessments for whether and how much human needs in downstream use cases can be satisfied by the given model (socio-technical gap). By drawing on lessons from the social sciences, human-computer interaction (HCI), and the interdisciplinary field of explainable AI (XAI), we urge the community to develop evaluation methods based on real-world socio-requirements and embrace diverse evaluation methods with an acknowledgment of trade-offs between realism to socio-requirements and pragmatic costs to conduct the evaluation. By mapping HCI and current NLG evaluation methods, we identify opportunities for evaluation methods for LLMs to narrow the socio-technical gap and pose open questions.
翻译:近年来,生成式大语言模型(LLMs)的发展为研究界和工业界正在应对的模型评估带来了新的挑战。尽管这些模型的多样化能力激发了人们的热情,但它们也不可避免地走向了同质化:即用一个常被称为“通用型”的模型来驱动各类应用。在这篇立场论文中,我们主张模型评估实践必须承担一项关键任务,以应对这种同质化带来的挑战与责任:提供有效的评估,判断给定模型能在多大程度上满足下游用例中的人类需求(即社会技术差距)。通过借鉴社会科学、人机交互(HCI)以及可解释人工智能(XAI)这一跨学科领域的经验教训,我们敦促学术界基于现实世界的社会需求开发评估方法,并接纳多样化的评估方式,同时承认社会需求现实性与评估实践成本之间的权衡。通过映射HCI与当前自然语言生成(NLG)评估方法,我们识别出缩小社会技术差距的LLM评估方法的机会,并提出了若干开放性问题。