Long text generation, such as novel writing or discourse-level translation with extremely long contexts, presents significant challenges to current language models. Existing methods mainly focus on extending the model's context window through strategies like length extrapolation. However, these approaches demand substantial hardware resources during the training and/or inference phases. Our proposed method, Temp-Lora, introduces an alternative concept. Instead of relying on the KV cache to store all context information, Temp-Lora embeds this information directly into the model's parameters. In the process of long text generation, we use a temporary Lora module, progressively trained with text generated previously. This approach not only efficiently preserves contextual knowledge but also prevents any permanent alteration to the model's parameters given that the module is discarded post-generation. Extensive experiments on the PG19 language modeling benchmark and the GuoFeng discourse-level translation benchmark validate the effectiveness of Temp-Lora. Our results show that: 1) Temp-Lora substantially enhances generation quality for long texts, as indicated by a 13.2% decrease in perplexity on a subset of PG19, and a 29.6% decrease in perplexity along with a 53.2% increase in BLEU score on GuoFeng, 2) Temp-Lora is compatible with and enhances most existing long text generation methods, and 3) Temp-Lora can greatly reduce computational costs by shortening the context window. While ensuring a slight improvement in generation quality (a decrease of 3.8% in PPL), it enables a reduction of 70.5% in the FLOPs required for inference and a 51.5% decrease in latency.
翻译:长文本生成(如长篇小说创作或超长上下文的语篇级翻译)对当前语言模型构成重大挑战。现有方法主要聚焦于通过长度外推等策略扩展模型的上下文窗口,但这些方法在训练和/或推理阶段需要大量硬件资源。我们提出的Temp-Lora方法引入全新思路:不依赖KV缓存存储全部上下文信息,而是将这些信息直接嵌入模型参数。在长文本生成过程中,我们使用临时Lora模块,通过渐进训练已生成文本实现嵌入。该方法不仅能高效保留上下文知识,还因生成后即丢弃该模块而避免对模型参数造成永久性修改。在PG19语言建模基准和GuoFeng语篇级翻译基准上的大量实验验证了Temp-Lora的有效性。结果表明:1) 在PG19子集上困惑度降低13.2%,在GuoFeng基准上困惑度降低29.6%且BLEU值提升53.2%,证明Temp-Lora显著提升长文本生成质量;2) Temp-Lora与现有大多数长文本生成方法兼容并能增强其性能;3) Temp-Lora可通过缩短上下文窗口大幅降低计算成本——在保证生成质量略有提升(PPL降低3.8%)的同时,实现推理所需FLOPs减少70.5%、延迟降低51.5%。