Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code. However, language models are not trained to perform well at these tasks, they are trained to accurately predict the next token given previous tokes in tokenized text. It is not clear whether language models are better or worse than humans at next token prediction. To try to answer this question, we performed two distinct experiments to directly compare humans and language models on this front: one measuring top-1 accuracy and the other measuring perplexity. In both experiments, we find humans to be consistently \emph{worse} than even relatively small language models like GPT3-Ada at next-token prediction.
翻译:当前的语言模型在自然语言任务(如问答或编写代码)上被认为具有低于人类的能力。然而,语言模型并非为在这些任务上表现优异而训练,而是针对给定分词文本中的先前词元来准确预测下一个词元进行训练。目前尚不清楚语言模型在下一词元预测方面是优于还是劣于人类。为了尝试回答这个问题,我们进行了两项不同的实验,直接在这一方面比较人类和语言模型:一项测量top-1准确率,另一项测量困惑度。在这两项实验中,我们发现人类在下一词元预测方面始终比甚至像GPT3-Ada这样相对较小的语言模型表现更差。