A fundamental challenge in Visual Autoregressive models is the substantial memory overhead required during inference to store previously generated representations. Despite various attempts to mitigate this issue through compression techniques, prior works have not explicitly formalized the problem of KV-cache compression in this context. In this work, we take the first step in formally defining the KV-cache compression problem for Visual Autoregressive transformers. We then establish a fundamental negative result, proving that any mechanism for sequential visual token generation under attention-based architectures must use at least $Ω(n^2 d)$ memory, when $d = Ω(\log n)$, where $n$ is the number of tokens generated and $d$ is the embedding dimensionality. This result demonstrates that achieving truly sub-quadratic memory usage is impossible without additional structural constraints. Our proof is constructed via a reduction from a computational lower bound problem, leveraging randomized embedding techniques inspired by dimensionality reduction principles. Finally, we discuss how sparsity priors on visual representations can influence memory efficiency, presenting both impossibility results and potential directions for mitigating memory overhead.
翻译:视觉自回归模型面临的一个根本性挑战是推理过程中需要大量内存来存储先前生成的表示。尽管已有多种尝试通过压缩技术缓解这一问题,但先前的研究并未在此背景下明确形式化KV缓存压缩问题。在本工作中,我们首次正式定义了视觉自回归Transformer的KV缓存压缩问题。随后,我们建立了一个根本性的否定结果,证明在基于注意力的架构下,任何顺序视觉令牌生成机制必须使用至少$Ω(n^2 d)$的内存,其中$d = Ω(\log n)$,这里$n$是生成的令牌数量,$d$是嵌入维度。这一结果表明,在没有额外结构约束的情况下,实现真正次平方级的内存使用是不可能的。我们的证明通过从计算下界问题归约构建,利用了受降维原理启发的随机嵌入技术。最后,我们讨论了视觉表示上的稀疏性先验如何影响内存效率,既展示了不可能性结果,也提出了缓解内存开销的潜在方向。