The key-value (KV) cache in the tensor version of transformers presents a significant bottleneck during inference. While previous work analyzes the fundamental space complexity barriers in standard attention mechanisms [Haris and Onak, 2025], our work generalizes the space complexity barriers result to tensor attention version. Our theoretical contributions rely on a reduction from communication complexity and deduce the memory lower bound for tensor-structured attention mechanisms when $d = \Omega(\log n)$. Furthermore, we introduce two types of tensor attention cache and present a trade-off between time and memory for two scenarios. Overall, our work provides a theoretical foundation for us to understand the time-memory tradeoff of KV-Cache compression in tensor attention decoding and offers more perspectives in developing more memory-efficient tensor attention Transformer architectures.
翻译:张量版本Transformer中的键值(KV)缓存已成为推理过程中的显著性能瓶颈。尽管先前研究已分析了标准注意力机制中的基本空间复杂度限制[Haris and Onak, 2025],但本研究将该空间复杂度限制结果推广至张量注意力版本。我们的理论贡献基于通信复杂度的归约推导,在$d = \Omega(\log n)$条件下推演出张量结构注意力机制的内存下界。此外,我们引入两种张量注意力缓存类型,并针对两种场景提出了时间与内存的权衡方案。总体而言,本研究为理解张量注意力解码中KV缓存压缩的时间-内存权衡关系奠定了理论基础,并为开发更高内存效率的张量注意力Transformer架构提供了更多视角。