Token-free LLMs Can Generate Chinese Classical Poetry with More Accurate Format

Finetuned large language models (such as ChatGPT and Qwen-chat) can generate Chinese classical poetry following human's instructions. LLMs perform well in content, but are usually lacking in format, with occasionally excess or insufficient number of characters in each line. Since most SOTA LLMs are token-based, we assume that the format inaccuracy is due to the difficulty of the "token planning" task, which means that the LLM need to know exactly how much characters are contained in each token and do length-control planning based on that knowledge. In this paper, we first confirm our assumption by showing that existing token-based large language models has limited knowledge on token-character relationship. We use a spelling bee probing procedure, and find that Qwen-chat failed in nearly 15% Chinese spelling test. We then show that a token-based model can be easily tailored into a token-free model (in terms of Chinese), which can largely solve the format accuracy problem. Our tailoring procedure removes long-tokens from the vocabulary and the language model head, and keeps only character-level or byte-level tokens. As part of our contribution, we release the finetuned token-free model (which is based on Qwen-chat-7B), which can generate chinese classical poetry following complex instructions like LLMs (such as story paraphrasing), and also perform well in format. On the test set, our token-free model achives an format accuracy of 0.96, compared to 0.84 for token-based equivalents and 0.38 for GPT-4.

翻译：微调后的大语言模型（如ChatGPT和Qwen-chat）能根据人类指令生成汉语古典诗歌。大语言模型在内容方面表现良好，但常存在格式缺陷，每行字符数偶有超出或不足。由于大多数先进的大语言模型基于词元（token），我们假设格式不准确源于"词元规划"任务的难度——模型需明确每个词元包含的字符数量，并据此进行长度控制规划。本文首先通过实验验证这一假设，发现现有基于词元的大语言模型对词元-字符关系的认知有限。我们采用拼写测验的探测方法，发现Qwen-chat在近15%的汉语拼写测试中出错。随后证明，基于词元的模型可便捷地改造为（针对汉语的）无词元模型，从而大幅解决格式准确性问题。改造过程包括：从词汇表与语言模型头部移除长词元，仅保留字符级或字节级词元。作为本文的贡献之一，我们发布了基于Qwen-chat-7B微调的无词元模型，该模型既能像大语言模型（如故事转述）一样按复杂指令生成汉语古典诗歌，又具备优越的格式表现。在测试集上，我们的无词元模型格式准确率达0.96，而基于词元的同类模型为0.84，GPT-4仅为0.38。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日