Multi-turn dialogues and context-intensive tasks challenge Large Language Models (LLMs) to integrate long histories without sacrificing generation quality. Although prefix LLMs can better exploit historical context via bidirectional attention on prefix tokens, they are rarely used in practice because multi-turn training requires many duplicated triplets, and its bidirectional prefix prevents KV-cache reuse at inference time, driving up high cost and latency. To retain the contextual understanding of prefix mask while preserving the inference-time efficiency of causal mask, we introduce Intermittent Semi-working Mask (ISM), a masking scheme that injects sparse bidirectional attention into the causal backbone. ISM alternates bidirectional attention over query segments with unidirectional attention over answer segments, enabling the synthesis of in-context while preserving global causality. This design eliminates triplet expansion during training and maintains KV-cache reuse during inference, yielding latency comparable to standard causal LLMs. ISM is architecture-agnostic and parameter-free, adding only minimal latency. Across extensive evaluations, ISM outperforms causal baselines not only on multi-turn dialogue, but also on context-intensive tasks like mathematical reasoning.
翻译:多轮对话与上下文密集型任务对大语言模型提出了挑战,要求其在整合长历史信息的同时不牺牲生成质量。尽管前缀型大语言模型能通过对前缀词元的双向注意力更好地利用历史上下文,但由于多轮训练需要大量重复的三元组,且其双向前缀特性在推理时阻碍了KV缓存的复用,导致高昂的成本与延迟,这类模型在实践中鲜少使用。为在保持因果掩码推理效率的同时保留前缀掩码的上下文理解能力,本文提出间歇性半工作掩码——一种在因果主干网络中注入稀疏双向注意力的掩码方案。该方案在查询片段上交替使用双向注意力,在回答片段上保持单向注意力,从而在维持全局因果性的同时实现上下文信息的融合。这一设计消除了训练期间的三元组扩展需求,并在推理时保持KV缓存复用,其延迟与标准因果大语言模型相当。该掩码方案具有架构无关性与零参数量特性,仅引入极低延迟。大量实验表明,该方案不仅在多轮对话任务上超越因果基线模型,在数学推理等上下文密集型任务中也表现优异。