Transformers operate as horizontal token-by-token scanners; at each generation step, attending to an ever-growing sequence of token-level states. This access pattern increases prefill latency and makes long-context decoding more memory-bound, as KV-cache reads and writes dominate inference time over arithmetic operations. We propose Parallel Hierarchical Operation for TOp-down Networks (PHOTON), a hierarchical autoregressive model that replaces horizontal scanning with vertical, multi-resolution context scanning. PHOTON maintains a hierarchy of latent streams: a bottom-up encoder compresses tokens into low-rate contextual states, while lightweight top-down decoders reconstruct fine-grained token representations in parallel. We further introduce recursive generation that updates only the coarsest latent stream and eliminates bottom-up re-encoding. Experimental results show that PHOTON is superior to competitive Transformer-based language models regarding the throughput-quality trade-off, providing advantages in long-context and multi-query tasks. In particular, this reduces decode-time KV-cache traffic, yielding up to $10^{3}\times$ higher throughput per unit memory.
翻译:Transformer 模型以水平逐词扫描方式运行;在每个生成步骤中,它需要关注一个不断增长的词元级状态序列。这种访问模式增加了预填充延迟,并使长上下文解码更加受内存限制,因为 KV 缓存的读写操作在推理时间中占据了主导地位,超过了算术运算。我们提出了并行分层操作用于自上而下网络(PHOTON),这是一种分层自回归模型,它用垂直、多分辨率上下文扫描取代了水平扫描。PHOTON 维护一个潜在流层次结构:一个自底向上的编码器将词元压缩为低速率上下文状态,而轻量级的自顶向下解码器则并行重建细粒度词元表示。我们进一步引入了递归生成机制,该机制仅更新最粗略的潜在流,并消除了自底向上的重新编码。实验结果表明,在吞吐量与质量的权衡方面,PHOTON 优于基于 Transformer 的竞争性语言模型,在长上下文和多查询任务中展现出优势。特别是,这减少了解码时的 KV 缓存流量,实现了高达 $10^{3}\times$ 的单位内存吞吐量提升。