Standard Large Language Models (LLMs) struggle with handling dialogues with long contexts due to efficiency and consistency issues. According to our observation, dialogue contexts are highly structured, and the special token of \textit{End-of-Utterance} (EoU) in dialogues has the potential to aggregate information. We refer to the EoU tokens as ``conversational attention sinks'' (conv-attn sinks). Accordingly, we introduce StreamingDialogue, which compresses long dialogue history into conv-attn sinks with minimal losses, and thus reduces computational complexity quadratically with the number of sinks (i.e., the number of utterances). Current LLMs already demonstrate the ability to handle long context window, e.g., a window size of 200k or more. To this end, by compressing utterances into EoUs, our method has the potential to handle more than 200k of utterances, resulting in a prolonged dialogue learning. In order to minimize information losses from reconstruction after compression, we design two learning strategies of short-memory reconstruction (SMR) and long-memory reactivation (LMR). Our method outperforms strong baselines in dialogue tasks and achieves a 4 $\times$ speedup while reducing memory usage by 18 $\times$ compared to dense attention recomputation.
翻译:标准大语言模型(LLMs)在处理长上下文对话时,因效率与一致性问题难以有效应对。根据我们的观察,对话上下文具有高度结构化特征,其中特殊的语句结束符(End-of-Utterance, EoU)具有信息聚合潜力。我们将EoU标记称为"对话注意力汇点"(conv-attn sinks)。据此,我们提出StreamingDialogue方法,将长对话历史以最小损失压缩至conv-attn汇点中,从而将计算复杂度从对话汇点数量(即语句数量)的二次方降低。当前LLMs已具备处理长上下文窗口的能力(例如200k及以上规模),通过将语句压缩为EoU标记,本方法理论上可处理超过200k条语句,实现持久化对话学习。为最小化压缩后重建过程的信息损失,我们设计了短记忆重建(SMR)与长记忆重激活(LMR)两种学习策略。实验表明,本方法在对话任务中显著优于强基线模型,相较于密集注意力重计算,实现了4倍速度提升与18倍内存占用降低。