Language understanding is a multi-faceted cognitive capability, which the Natural Language Processing (NLP) community has striven to model computationally for decades. Traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. With the advent of large language models (LLMs) the community has witnessed a dramatic shift towards general purpose, task-agnostic approaches powered by generative models. As a consequence, the traditional compartmentalized notion of language tasks is breaking down, followed by an increasing challenge for evaluation and analysis. At the same time, LLMs are being deployed in more real-world scenarios, including previously unforeseen zero-shot setups, increasing the need for trustworthy and reliable systems. Therefore, we argue that it is time to rethink what constitutes tasks and model evaluation in NLP, and pursue a more holistic view on language, placing trustworthiness at the center. Towards this goal, we review existing compartmentalized approaches for understanding the origins of a model's functional capacity, and provide recommendations for more multi-faceted evaluation protocols.
翻译:语言理解是一种多方面的认知能力,自然语言处理(NLP)领域数十年来一直致力于对其进行计算建模。传统上,语言智能的不同方面被划分为各种任务,并配有专门的模型架构及相应的评估协议。随着大规模语言模型(LLMs)的出现,该领域见证了向由生成模型驱动的通用、任务无关方法的重大转变。由此,传统上对语言任务的划分观念正在瓦解,随之而来的是评估与分析面临的日益加剧的挑战。与此同时,LLMs正被部署在更多真实场景中,包括先前未曾预料到的零样本设置,这增加了对可信赖性与可靠性系统的需求。因此,我们认为,是时候重新思考NLP中任务与模型评估的构成,并追求一种更全面的语言观,将可信赖性置于核心地位。为实现这一目标,我们回顾了现有用于理解模型功能能力来源的划分式方法,并提出了更全面的多维评估协议的建议。