The recent development of generative and large language models (LLMs) poses new challenges for model evaluation that the research community and industry are grappling with. While the versatile capabilities of these models ignite excitement, they also inevitably make a leap toward homogenization: powering a wide range of applications with a single, often referred to as ``general-purpose'', model. In this position paper, we argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization: providing valid assessments for whether and how much human needs in downstream use cases can be satisfied by the given model (\textit{socio-technical gap}). By drawing on lessons from the social sciences, human-computer interaction (HCI), and the interdisciplinary field of explainable AI (XAI), we urge the community to develop evaluation methods based on real-world socio-requirements and embrace diverse evaluation methods with an acknowledgment of trade-offs between realism to socio-requirements and pragmatic costs. By mapping HCI and current NLG evaluation methods, we identify opportunities for new evaluation methods for LLMs to narrow the socio-technical gap and pose open questions.
翻译:生成式大语言模型(LLM)的最新发展给研究界和工业界的模型评估带来了新的挑战。尽管这些模型的多功能能力令人兴奋,但它们也不可避免地趋向同质化:用单个(通常称为“通用”)模型支持广泛的应用。在这篇立场论文中,我们认为模型评估实践必须承担一项关键任务,以应对这种同质化带来的挑战和责任:提供有效评估,判断给定模型能在多大程度上满足下游用例中的人类需求(社会技术差距)。通过借鉴社会科学、人机交互(HCI)和可解释人工智能(XAI)跨学科领域的经验教训,我们敦促研究界基于现实世界的社会需求开发评估方法,并在承认社会需求的现实性与实用成本之间权衡的情况下,拥抱多样化的评估方法。通过映射HCI与当前自然语言生成(NLG)评估方法,我们识别出新的LLM评估方法以缩小社会技术差距的机会,并提出开放性问题。