Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. Most LLMs split text into multi-character tokens and process them as atomic units without direct access to individual characters. This raises the question: To what extent can LLMs learn orthographic information? To answer this, we propose a new benchmark, CUTE, which features a collection of tasks designed to test the orthographic knowledge of LLMs. We evaluate popular LLMs on CUTE, finding that most of them seem to know the spelling of their tokens, yet fail to use this information effectively to manipulate text, calling into question how much of this knowledge is generalizable.
翻译:大型语言模型(LLMs)在各类任务中展现出卓越性能。多数LLMs将文本分割为多字符词元,并将其作为原子单元进行处理,无法直接访问单个字符。这引发了一个问题:LLMs能在多大程度上学习正字法信息?为探究此问题,我们提出了新的基准测试CUTE,该基准包含一系列专门设计用于测试LLMs正字法知识的任务。我们在CUTE上评估了主流LLMs,发现大多数模型似乎了解其词元的拼写规则,却无法有效利用这些信息进行文本操作,这引发了对这类知识可泛化程度的质疑。