Recent advancements in large language models(LLMs), such as GPT-4 and GPT-4o, have shown exceptional performance, especially in languages with abundant resources like English, thanks to extensive datasets that ensure robust training. Conversely, these models exhibit limitations when processing under-resourced languages such as Chinese and Korean, where issues including hallucinatory responses remain prevalent. This paper traces the roots of these disparities to the tokenization process inherent to these models. Specifically, it explores how the tokenizers vocabulary, often used to speed up the tokenization process and reduce tokens but constructed independently of the actual model training data, inadequately represents non-English languages. This misrepresentation results in the propagation of under-trained or untrained tokens, which perpetuate biases and pose serious concerns related to data security and ethical standards. We aim to dissect the tokenization mechanics of GPT-4o, illustrating how its simplified token-handling methods amplify these risks and offer strategic solutions to mitigate associated security and ethical issues. Through this study, we emphasize the critical need to rethink tokenization frameworks to foster more equitable and secure AI technologies. The code and data are available at https://github.com/yeyimilk/LLMGPT4o
翻译:近年来,以GPT-4和GPT-4o为代表的大型语言模型(LLMs)取得了卓越的性能,尤其在英语等资源丰富的语言中表现突出,这得益于确保稳健训练的大规模数据集。相反,这些模型在处理中文和韩语等资源不足的语言时表现出局限性,其中包含幻觉响应在内的问题仍然普遍存在。本文将这些差异的根源追溯到这些模型固有的分词过程。具体而言,它探讨了分词器词汇表(通常用于加速分词过程并减少标记数量,但其构建独立于实际模型训练数据)如何不足以充分表征非英语语言。这种表征不当导致训练不足或未训练标记的传播,从而延续偏见并引发与数据安全和伦理标准相关的严重问题。我们旨在剖析GPT-4o的分词机制,阐明其简化的标记处理方法如何放大这些风险,并提供缓解相关安全与伦理问题的策略性解决方案。通过本研究,我们强调必须重新思考分词框架,以促进更公平、更安全的人工智能技术。代码和数据可在 https://github.com/yeyimilk/LLMGPT4o 获取。