Playing with Words: Comparing the Vocabulary and Lexical Richness of ChatGPT and Humans

The introduction of Artificial Intelligence (AI) generative language models such as GPT (Generative Pre-trained Transformer) and tools such as ChatGPT has triggered a revolution that can transform how text is generated. This has many implications, for example, as AI-generated text becomes a significant fraction of the text in many disciplines, would this have an effect on the language capabilities of readers and also on the training of newer AI tools? Would it affect the evolution of languages? Focusing on one specific aspect of the language: words; will the use of tools such as ChatGPT increase or reduce the vocabulary used or the lexical richness (understood as the number of different words used in a written or oral production) when writing a given text? This has implications for words, as those not included in AI-generated content will tend to be less and less popular and may eventually be lost. In this work, we perform an initial comparison of the vocabulary and lexical richness of ChatGPT and humans when performing the same tasks. In more detail, two datasets containing the answers to different types of questions answered by ChatGPT and humans are used, and the analysis shows that ChatGPT tends to use fewer distinct words and lower lexical richness than humans. These results are very preliminary and additional datasets and ChatGPT configurations have to be evaluated to extract more general conclusions. Therefore, further research is needed to understand how the use of ChatGPT and more broadly generative AI tools will affect the vocabulary and lexical richness in different types of text and languages.

翻译：人工智能生成语言模型（如GPT，即生成式预训练Transformer）及工具（如ChatGPT）的引入，引发了一场可能改变文本生成方式的革命。这具有多重影响：例如，当AI生成文本在众多学科中占据显著比例时，是否会影响读者的语言能力，或对新型AI工具的训练产生作用？是否会影响语言的演化？聚焦于语言的一个具体方面——词汇：在撰写特定文本时，使用ChatGPT这类工具是否会增加或减少所使用的词汇量或词汇丰富度（即书面或口头表达中使用的不同词汇数量）？这对词汇本身意义重大，因为未包含在AI生成内容中的词汇将逐渐式微，最终可能消亡。本研究初步对比了ChatGPT与人类在完成相同任务时的词汇与词汇丰富度。具体而言，我们使用了两个数据集，分别包含ChatGPT与人类对各类问题的回答。分析显示，ChatGPT相较于人类倾向于使用更少的不同词汇，且词汇丰富度更低。这些结果尚属初步，需评估更多数据集及ChatGPT的不同配置以得出更具普适性的结论。因此，需进一步研究以理解ChatGPT及更广泛的生成式AI工具如何影响不同类型文本及语言的词汇与词汇丰富度。