Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking

Providing extensive context via prompting is vital for leveraging the capabilities of Large Language Models (LLMs). However, lengthy contexts significantly increase inference latency, as the computational cost of self-attention grows quadratically with sequence length. To mitigate this issue, context compression-particularly soft prompt compressio-has emerged as a widely studied solution, which converts long contexts into shorter memory embeddings via a trained compressor. Existing methods typically compress the entire context indiscriminately into a set of memory tokens, requiring the compressor to capture global dependencies and necessitating extensive pre-training data to learn effective patterns. Inspired by the chunking mechanism in human working memory and empirical observations of the spatial specialization of memory embeddings relative to original tokens, we propose Parallelized Iterative Compression (PIC). By simply modifying the Transformer's attention mask, PIC explicitly restricts the receptive field of memory tokens to sequential local chunks, thereby lowering the difficulty of compressor training. Experiments across multiple downstream tasks demonstrate that PIC consistently outperforms competitive baselines, with superiority being particularly pronounced in high compression scenarios (e.g., achieving relative improvements of 29.8\% in F1 score and 40.7\% in EM score on QA tasks at the $64\times$ compression ratio). Furthermore, PIC significantly expedites the training process. Specifically, when training the 16$\times$ compressor, it surpasses the peak performance of the competitive baseline while effectively reducing the training time by approximately 40\%.

翻译：通过提示提供丰富上下文对于发挥大型语言模型（LLM）的能力至关重要。然而，长上下文会显著增加推理延迟，因为自注意力机制的计算成本随序列长度呈二次方增长。为缓解此问题，上下文压缩——尤其是软提示压缩——已成为广泛研究的解决方案，其通过训练压缩器将长上下文转换为更短的记忆嵌入。现有方法通常将整个上下文不加区分地压缩为一组记忆标记，要求压缩器捕获全局依赖关系，并需要大量预训练数据以学习有效模式。受人类工作记忆中的分块机制以及对记忆嵌入相对于原始标记的空间专业化特征的实证观察启发，我们提出并行迭代压缩（PIC）。该方法仅通过修改Transformer的注意力掩码，显式地将记忆标记的感受野限制在顺序局部块中，从而降低压缩器训练的难度。在多个下游任务上的实验表明，PIC始终优于竞争基线方法，其优势在高压缩场景下尤为显著（例如在QA任务中，当压缩比为$64\times$时，F1分数相对提升29.8\%，EM分数相对提升40.7\%）。此外，PIC显著加速了训练过程。具体而言，在训练$16\times$压缩器时，其性能超越竞争基线的峰值水平，同时有效减少约40\%的训练时间。