Transformer-based diffusion models offer superior scalability and performance but suffer from high computational overhead due to the iterative nature and quadratic complexity of self-attention at high resolutions. In this paper, we propose DiSC, a resolution-scalable, sparsity-aware hardware accelerator. At the software level, DiSC introduces two algorithms: Cached Token Reuse (CTR), and Softmax Thresholding with Sparsity Mask Reuse (ST). CTR introduces a mechanism that translates spatial variations in the input latent difference across steps into a token-level reuse decision, effectively eliminating redundant token computation. ST induces sparsity in attention operations by reusing a generated sparsity pattern, leveraging temporal similarity to bypass costly prediction overhead. Together, these algorithms provide resolution-scalable computational benefits and yield a moderate sparsity and hybrid dense-sparse workload. To exploit this efficiently, we design a specialized hardware architecture and unified dataflow. This architecture avoids dedicated sparsity-handling components; instead, a hash-based distribution over on-chip memory banks allows DiSC to reuse its existing compute engines for sparse operations, efficiently exploiting the induced sparsity with minimal hardware overhead. Evaluated on DiT and PixArt-Sigma, DiSC achieves 3.47-4.74x and 2.48-3.50x speedups over NVIDIA A100 and H100 GPUs, respectively, with energy savings ranging from 46.4% to 68.1%.
翻译:基于Transformer的扩散模型具有优越的可扩展性和性能,但由于迭代特性及高分辨率下自注意力的二次复杂度,其计算开销巨大。本文提出DiSC——一种分辨率可扩展的稀疏感知硬件加速器。软件层面,DiSC引入两种算法:缓存令牌复用(CTR)和稀疏掩码复用的Softmax阈值化(ST)。CTR通过跨步骤输入潜变量差异的空间变化机制,将其转化为令牌级复用决策,有效消除冗余令牌计算;ST通过复用生成的稀疏模式来诱导注意力运算中的稀疏性,利用时间相似性避免高昂的预测开销。这两种算法共同提供分辨率可扩展的计算优势,并产生适度稀疏与混合稠密-稀疏工作负载。为高效利用该特性,我们设计了专用硬件架构与统一数据流。该架构避免采用专用稀疏处理组件,而是通过片上存储体基于哈希的分布机制,使DiSC可复用现有计算引擎执行稀疏运算,以最小硬件开销高效利用诱导的稀疏性。在DiT和PixArt-Sigma上的评估表明,DiSC相比NVIDIA A100和H100 GPU分别实现3.47-4.74倍和2.48-3.50倍的加速,节能幅度达46.4%至68.1%。