Selective state space models (SSMs) have rapidly become a compelling backbone for large language models, especially for long-context workloads. Yet in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU, making multi-GPU execution increasingly necessary. Although tensor parallelism (TP) is widely used to scale Transformer inference, applying it to selective SSM blocks is non-trivial because the SSM mixer couples large projections with a sequence-wise recurrent state update and local mixing whose efficiency depends on preserving locality and avoiding synchronization in the critical path. This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges: enabling TTFT improvements via an SSM state cache across prefill and decode, partitioning the mixer's packed parameter tensor so that recurrent updates remain local while minimizing communication, and reducing TP aggregation overhead with quantized AllReduce. We evaluate on three representative SSM-based LLMs spanning pure-SSM and hybrid architectures - Mamba, Falcon-Mamba, and Zamba - on NVIDIA A6000 and A100 clusters. Our experiments show substantial throughput gains from tensor-parallel SSM inference, improving batch-request throughput by ~1.6-2.1x on 2 GPUs and ~2.6-4.0x on 4 GPUs for Mamba, with the largest benefits at long context lengths, and achieving a further ~10-18% throughput improvement from quantized all-reduce by lowering synchronization bandwidth overhead.
翻译:选择性状态空间模型(SSMs)已迅速成为大型语言模型的有力骨干架构,尤其在长上下文任务中表现突出。然而在实际部署中,其推理性能往往受限于单个GPU的内存容量、带宽和延迟,使得多GPU执行变得日益必要。尽管张量并行(TP)已广泛用于扩展Transformer推理,但将其应用于选择性SSM模块却非易事,因为SSM混合器将大规模投影与序列级递归状态更新及局部混合相耦合,其效率依赖于保持数据局部性并避免关键路径中的同步。本文提出了一种通信高效的TP设计方案用于选择性SSM推理,解决了三个实际工程挑战:通过跨预填充和解码阶段的SSM状态缓存实现首词元时间优化;对混合器的打包参数张量进行分区,使递归更新保持局部性同时最小化通信;以及通过量化AllReduce降低TP聚合开销。我们在NVIDIA A6000和A100集群上评估了三种代表性基于SSM的LLM(涵盖纯SSM与混合架构)——Mamba、Falcon-Mamba和Zamba。实验表明张量并行SSM推理带来显著的吞吐量提升:对于Mamba模型,在2GPU上实现约1.6-2.1倍、4GPU上约2.6-4.0倍的批处理请求吞吐量增长,且长上下文场景收益最大;通过量化全归约降低同步带宽开销,可进一步获得约10-18%的吞吐量提升。