Mamba is an emerging, complex workload with various short-range and long-range dependencies, nonlinearities, and elementwise computations that are unable to run at near-peak speeds on modern hardware. Specifically, Mamba's complex dependency graph makes fusion across its full operator cascade difficult, leaving substantial inter-operator memory traffic on the table. To address these challenges, we propose Mambalaya, a novel reconfigurable accelerator that leverages fusion to overcome the limitations of Mamba. We use the recently proposed cascade-of-Einsums abstraction to characterize Mamba's full computational structure, then apply the extended Einsum framework to systematically explore inter-Einsum fusion opportunities. This principled approach yields a series of fusion mappings that reduce off-chip inter-Einsum traffic. These mappings are supported by the underlying Mambalaya architecture. Mambalaya achieves a layer performance speedup of 4.9$\times$ for prefill and 1.9$\times$ for generation over MARCA. In prefill-dominated scenarios, it achieves up to 1.5$\times$ over a recent fine-grained, memory-aware fusion accelerator for Mamba.
翻译:曼巴(Mamba)是一种新兴的复杂工作负载,包含多种短程与长程依赖、非线性运算及逐元素计算,难以在现代硬件上实现近峰值运行速度。具体而言,曼巴复杂的依赖关系图使其难以在完整的算子级联上进行融合,导致大量跨算子内存流量无法被充分利用。为解决这一挑战,我们提出Mambalaya——一种新型可重构加速器,通过融合技术突破曼巴的局限性。我们采用近期提出的Einsum级联抽象来表征曼巴的完整计算结构,随后应用扩展后的Einsum框架系统探索Einsum间融合机会。这一原则性方法产生了一系列融合映射,可减少片外Einsum间数据流量,而这些映射由底层Mambalaya架构提供支持。相较于MARCA,Mambalaya在预填充阶段实现了4.9倍的单层性能加速,在生成阶段实现了1.9倍加速。在预填充主导的场景中,其性能较近期为曼巴设计的细粒度内存感知融合加速器提升最高达1.5倍。