Large language models are increasingly applied to multi-document and long-form inputs, yet long-context inference remains memory- and noise-inefficient. Key-value (KV) caching scales linearly with context length, while external retrieval methods often return lexically similar but causally irrelevant passages. We present S3-Attention, a memory-first inference-time framework that treats long-context processing as attention-aligned endogenous retrieval. S3-Attention decodes transient key and query projections into top-k sparse feature identifiers using lightweight sparse autoencoders, and constructs a CPU-based inverted index mapping features to token positions or spans during a single streaming scan. This design allows the KV cache to be discarded entirely and bounds GPU memory usage by the scan chunk size. At generation time, feature co-activation is used to retrieve compact evidence spans, optionally fused with BM25 for exact lexical matching. Under a unified LongBench evaluation protocol with fixed prompting, decoding, and matched token budgets, S3-Hybrid closely matches full-context inference across multiple model families and improves robustness in several information-dense settings. We also report an engineering limitation of the current prototype, which incurs higher wall-clock latency than optimized full-KV baselines, motivating future kernel-level optimization.
翻译:大型语言模型越来越多地应用于多文档和长文本输入,但长上下文推理在内存和噪声处理方面仍然效率低下。键值(KV)缓存随上下文长度线性增长,而外部检索方法通常返回词汇相似但因果无关的段落。我们提出了S3-Attention,这是一个内存优先的推理时框架,将长上下文处理视为注意力对齐的内生检索。S3-Attention使用轻量级稀疏自编码器将瞬时的键和查询投影解码为top-k稀疏特征标识符,并在单次流式扫描期间构建一个基于CPU的倒排索引,将特征映射到词元位置或片段。这种设计允许完全丢弃KV缓存,并通过扫描块大小限制GPU内存使用。在生成时,利用特征共激活来检索紧凑的证据片段,并可选择与BM25融合以实现精确词汇匹配。在统一的LongBench评估协议下,采用固定的提示、解码和匹配的词元预算,S3-Hybrid在多个模型家族中与全上下文推理结果高度吻合,并在多个信息密集场景中提高了鲁棒性。我们还报告了当前原型的一个工程限制,即其实际耗时高于经过优化的全KV基线,这为未来的内核级优化提供了动力。