Standard Large Language Models (LLMs) struggle with handling dialogues with long contexts due to efficiency and consistency issues. According to our observation, dialogue contexts are highly structured, and the special token of \textit{End-of-Utterance} (EoU) in dialogues has the potential to aggregate information. We refer to the EoU tokens as ``conversational attention sinks'' (conv-attn sinks). Accordingly, we introduce StreamingDialogue, which compresses long dialogue history into conv-attn sinks with minimal losses, and thus reduces computational complexity quadratically with the number of sinks (i.e., the number of utterances). Current LLMs already demonstrate the ability to handle long context window, e.g., a window size of 200K or more. To this end, by compressing utterances into EoUs, our method has the potential to handle more than 200K of utterances, resulting in a prolonged dialogue learning. In order to minimize information losses from reconstruction after compression, we design two learning strategies of short-memory reconstruction (SMR) and long-memory reactivation (LMR). Our method outperforms strong baselines in dialogue tasks and achieves a 4 $\times$ speedup while reducing memory usage by 18 $\times$ compared to dense attention recomputation.
翻译:标准的大型语言模型(LLMs)在处理长上下文对话时,因效率与一致性问题而面临困难。根据我们的观察,对话上下文具有高度结构性,且对话中的特殊标记——\textit{话语结束符}(EoU)——具备信息聚合的潜力。我们将这些EoU标记称为“对话注意力汇聚点”(conv-attn sinks)。基于此,我们提出了StreamingDialogue方法,该方法以最小损失将长对话历史压缩至conv-attn sinks中,从而将计算复杂度随汇聚点数量(即话语数量)呈二次方降低。当前LLMs已展现出处理长上下文窗口的能力,例如窗口大小可达200K或更高。因此,通过将话语压缩至EoUs,我们的方法有望处理超过200K的话语量,实现持续对话学习。为最小化压缩后重建过程中的信息损失,我们设计了短时记忆重建(SMR)与长时记忆再激活(LMR)两种学习策略。我们的方法在对话任务中优于多个强基线模型,与密集注意力重计算相比,实现了4倍的加速,同时内存使用减少了18倍。