If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergence of, ascribe to, or assume, generalised anthropomorphic attributes to them (e.g., morality or understanding of natural language). Our goal is not to argue in favour or against the existence of these attributes, but to point out that these conclusions could be incorrect. For this we build and train a simple neural network on the videogame Age of Empires II, and note that any entity in a sufficiently-powerful substrate, such as LEGO or the Greater Boston Area, could also present such attributes. Hence, the purported anthropomorphic attributes of LLMs are empirically non-unique: although some properties (e.g., responses to prompts) could remain invariant, others, such as the interpretation of their perceived behaviour, might change with the substrate. Thus, any empirically-grounded discussion on these attributes requires explicit measurement criteria; otherwise the interpretation is left to the representation. We then show that assuming that these attributes exist or not in a system, independent of the substrate and in a generalised way, leads to either circular or uninformative conclusions. This is regardless of the experimenter's viewpoint on the subject, or whether the outcome shows existence or non-existence. Finally we propose a 'null' assumption, where one assumes LLM non-uniqueness instead of assuming anthropomorphic attributes to set up an experiment, along with examples of it. We also discuss potential objections to our work, briefly survey the field, and prove that Age of Empires II is functionally- and Turing-complete.

翻译：关于大语言模型（LLM）及其驱动的智能体工作流已有大量研究。然而，该领域的许多工作宣称、归因或假设这些模型涌现出拟人化的泛化属性（如道德判断或自然语言理解）。我们的目标并非论证这些属性的存在与否，而是指出此类结论可能存在谬误。为此，我们基于电子游戏《帝国时代II》构建并训练了一个简单的神经网络，并注意到任何处于足够强大的基板（如乐高积木或大波士顿地区）上的实体，都可能展现出类似属性。因此，LLM被宣称的拟人化属性在经验上并不具有独特性：尽管某些特性（如对提示的响应）可能保持恒定，但其他特性（如其感知行为的诠释方式）会随基板而改变。要对此类属性进行基于经验的讨论，必须建立明确的测量标准；否则，诠释将完全取决于表征方式。我们进而证明，无论实验者持何种立场，也无论结果证实存在与否，若以泛化方式假设这些属性独立于基板而存在于系统中，都将导致循环论证或空洞结论。最终，我们提出"零假设"方法——即在实验设计时假定LLM不具有独特性而非拟人化属性，并给出相应案例。此外，我们探讨了学界可能对本研究的质疑，简要梳理了领域现状，并证明《帝国时代II》具有图灵完备性与函数完备性。