Finetuned large language models (such as ChatGPT and Qwen-chat) can generate Chinese classical poetry following human's instructions. LLMs perform well in content, but are usually lacking in format, with occasionally excess or insufficient number of characters in each line. Since most SOTA LLMs are token-based, we assume that the format inaccuracy is due to the difficulty of the "token planning" task, which means that the LLM need to know exactly how much characters are contained in each token and do length-control planning based on that knowledge. In this paper, we first confirm our assumption by showing that existing token-based large language models has limited knowledge on token-character relationship. We use a spelling bee probing procedure, and find that Qwen-chat failed in nearly 15% Chinese spelling test. We then show that a token-based model can be easily tailored into a token-free model (in terms of Chinese), which can largely solve the format accuracy problem. Our tailoring procedure removes long-tokens from the vocabulary and the language model head, and keeps only character-level or byte-level tokens. As part of our contribution, we release the finetuned token-free model (which is based on Qwen-chat-7B), which can generate chinese classical poetry following complex instructions like LLMs (such as story paraphrasing), and also perform well in format. On the test set, our token-free model achives an format accuracy of 0.96, compared to 0.84 for token-based equivalents and 0.38 for GPT-4.
翻译:微调后的大语言模型(如ChatGPT和Qwen-chat)能根据人类指令生成汉语古典诗歌。大语言模型在内容方面表现良好,但常存在格式缺陷,每行字符数偶有超出或不足。由于大多数先进的大语言模型基于词元(token),我们假设格式不准确源于"词元规划"任务的难度——模型需明确每个词元包含的字符数量,并据此进行长度控制规划。本文首先通过实验验证这一假设,发现现有基于词元的大语言模型对词元-字符关系的认知有限。我们采用拼写测验的探测方法,发现Qwen-chat在近15%的汉语拼写测试中出错。随后证明,基于词元的模型可便捷地改造为(针对汉语的)无词元模型,从而大幅解决格式准确性问题。改造过程包括:从词汇表与语言模型头部移除长词元,仅保留字符级或字节级词元。作为本文的贡献之一,我们发布了基于Qwen-chat-7B微调的无词元模型,该模型既能像大语言模型(如故事转述)一样按复杂指令生成汉语古典诗歌,又具备优越的格式表现。在测试集上,我们的无词元模型格式准确率达0.96,而基于词元的同类模型为0.84,GPT-4仅为0.38。