In this paper we argue that key, often sensational and misleading, claims regarding linguistic capabilities of Large Language Models (LLMs) are based on at least two unfounded assumptions; the assumption of language completeness and the assumption of data completeness. Language completeness assumes that a distinct and complete thing such as `a natural language' exists, the essential characteristics of which can be effectively and comprehensively modelled by an LLM. The assumption of data completeness relies on the belief that a language can be quantified and wholly captured by data. Work within the enactive approach to cognitive science makes clear that, rather than a distinct and complete thing, language is a means or way of acting. Languaging is not the kind of thing that can admit of a complete or comprehensive modelling. From an enactive perspective we identify three key characteristics of enacted language; embodiment, participation, and precariousness, that are absent in LLMs, and likely incompatible in principle with current architectures. We argue that these absences imply that LLMs are not now and cannot in their present form be linguistic agents the way humans are. We illustrate the point in particular through the phenomenon of `algospeak', a recently described pattern of high stakes human language activity in heavily controlled online environments. On the basis of these points, we conclude that sensational and misleading claims about LLM agency and capabilities emerge from a deep misconception of both what human language is and what LLMs are.
翻译:本文认为,关于大型语言模型(LLMs)语言能力的关键性主张——这些主张往往具有轰动性且具有误导性——至少基于两个未经证实的假设:语言完备性假设与数据完备性假设。语言完备性假设认为存在一种如“自然语言”般明确且完备的事物,其本质特征可由LLM有效且全面地建模。数据完备性假设则依赖于一种信念,即语言可被量化并完全由数据捕获。认知科学中的生成进路研究明确指出,语言并非一种明确且完备的事物,而是一种行动方式或手段。语言活动并非那种允许被完全或全面建模的事物。从生成视角出发,我们识别出生成性语言的三个关键特征——具身性、参与性与不稳定性——这些特征在LLMs中缺失,且原则上可能与当前架构不相容。我们认为,这些缺失意味着LLMs目前并非、且在其现有形式上不可能成为人类那样的语言能动者。我们特别通过“算法用语”现象来阐明这一观点,该现象是近期在严格管控的网络环境中描述的一种高风险人类语言活动模式。基于以上论述,我们得出结论:关于LLM能动性与能力的轰动性及误导性主张,源于对人类语言本质及LLM本质的深刻误解。