Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding and production. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills.
翻译:词汇测试曾一度是语言模型评估的基石,但在当前以Llama、Mistral和GPT为代表的大型语言模型(LLMs)研究领域中,这一方法在很大程度上被忽视。尽管大多数LLM评估基准侧重于特定任务或特定领域的知识,但它们往往忽略了语言理解与生成的基本语言学层面。本文主张重新引入词汇测试作为评估LLM性能的有效工具。我们使用两种词汇测试格式,对七种LLM在两种语言中进行了评估,并揭示了其词汇知识中令人惊讶的空白。这些发现揭示了LLM词汇表征的复杂性、其学习机制以及模型与语言之间的性能差异。此外,自动生成并实施词汇测试的能力为拓展该方法提供了新的机遇,有助于更全面地呈现LLM的语言技能。