Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44$\times$ higher token throughput and up to 1.6$\times$ reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.
翻译:生成长形式内容,例如分钟级视频和扩展文本,对于现代生成模型日益重要。块扩散通过KV缓存和块级因果推断提升了推理效率,并已在扩散语言模型和视频生成中得到广泛应用。然而,在长上下文场景下,块扩散仍会因对不断增长的KV缓存重复计算注意力而产生显著开销。我们发现块扩散中存在一个尚未被充分探索的特性:块内注意力的跨步冗余性。分析表明,来自当前块外部词元的注意力输出在扩散步骤间基本保持稳定,而块内部注意力则变化显著。基于此观察,我们提出FlashBlock,一种缓存块外部注意力的机制,它复用稳定的注意力输出,从而在不改变扩散过程的前提下减少注意力计算与KV缓存访问。此外,FlashBlock与稀疏注意力正交,可作为互补的残差复用策略与之结合,在激进稀疏化条件下显著提升模型精度。在扩散语言模型和视频生成上的实验表明,该方法可实现高达1.44倍的词元吞吐量提升和高达1.6倍的注意力时间减少,且对生成质量影响可忽略。项目页面:https://caesarhhh.github.io/FlashBlock/。