Characterizing State Space Model and Hybrid Language Model Performance with Long Context

Emerging applications such as AR are driving demands for machine intelligence capable of processing continuous and/or long-context inputs on local devices. However, currently dominant models based on Transformer architecture suffers from the quadratic computational and memory overhead, which hinders applications required to process long contexts. This has spurred a paradigm shift towards new architectures like State Space Models (SSMs) and SSM-Transformer hybrid models, which provide near-linear scaling. The near-linear scaling enabled efficient handling of millions of tokens while delivering high performance in recent studies. Although such works present promising results, their workload characteristics in terms of computational performance and hardware resource requirements are not yet thoroughly explored, which limits our understanding of their implications to the system level optimizations. To address this gap, we present a comprehensive, compara-ive benchmarking of carefully selected Transformers, SSMs, and hybrid models specifically for long-context inference on consumer and embedded GPUs. Our analysis shows that SSMs are well-suited for on-device AI on consumer and embedded GPUs for long context inferences. While Transformers are up to 1.9x faster at short sequences (<8K tokens), SSMs demonstrate a dramatic performance inversion, becoming up to 4x faster at very long contexts (~57K tokens), thanks to their linear computational complexity and ~64% reduced memory footrprint. Our operator-level analysis reveals that custom SSM kernels like selective scan despite being hardware-aware to minimize memory IO, dominate the inference runtime on edge platforms, accounting for over 55% of latency due to their sequential, element-wise nature. To foster further research, we will open-source our characterization framework.

翻译：增强现实等新兴应用正推动着对能够在本地设备上处理连续和/或长上下文输入的机器智能的需求。然而，当前基于Transformer架构的主流模型受限于二次方的计算与内存开销，这阻碍了需要处理长上下文的应用。这促使研究范式向状态空间模型等新架构以及SSM-Transformer混合模型转变，这些模型提供了近似线性的扩展能力。近似线性的扩展使得在近期研究中能够高效处理数百万个标记，同时保持高性能。尽管此类工作展现了有前景的结果，但它们在计算性能和硬件资源需求方面的工作负载特性尚未得到深入探索，这限制了我们对其在系统级优化方面影响的理解。为填补这一空白，我们针对消费级和嵌入式GPU上的长上下文推理任务，对精心挑选的Transformer、SSM及混合模型进行了全面、比较性的基准测试。我们的分析表明，对于长上下文推理，SSM非常适合在消费级和嵌入式GPU上实现设备端人工智能。虽然Transformer在处理短序列时速度最快可达SSM的1.9倍，但在处理极长上下文时，SSM展现出显著的性能反转，速度最快可达Transformer的4倍，这得益于其线性计算复杂度以及约64%的内存占用减少。我们的算子级分析揭示，尽管像选择性扫描这样的定制SSM内核在硬件感知设计上旨在最小化内存I/O，但由于其顺序、逐元素的特性，在边缘平台上主导了推理运行时间，占延迟的55%以上。为促进进一步研究，我们将开源我们的特征分析框架。