Emerging applications such as AR are driving demands for machine intelligence capable of processing continuous and/or long-context inputs on local devices. However, currently dominant models based on Transformer architecture suffers from the quadratic computational and memory overhead, which hinders applications required to process long contexts. This has spurred a paradigm shift towards new architectures like State Space Models (SSMs) and SSM-Transformer hybrid models, which provide near-linear scaling. The near-linear scaling enabled efficient handling of millions of tokens while delivering high performance in recent studies. Although such works present promising results, their workload characteristics in terms of computational performance and hardware resource requirements are not yet thoroughly explored, which limits our understanding of their implications to the system level optimizations. To address this gap, we present a comprehensive, compara-ive benchmarking of carefully selected Transformers, SSMs, and hybrid models specifically for long-context inference on consumer and embedded GPUs. Our analysis shows that SSMs are well-suited for on-device AI on consumer and embedded GPUs for long context inferences. While Transformers are up to 1.9x faster at short sequences (<8K tokens), SSMs demonstrate a dramatic performance inversion, becoming up to 4x faster at very long contexts (~57K tokens), thanks to their linear computational complexity and ~64% reduced memory footrprint. Our operator-level analysis reveals that custom SSM kernels like selective scan despite being hardware-aware to minimize memory IO, dominate the inference runtime on edge platforms, accounting for over 55% of latency due to their sequential, element-wise nature. SSM-Scope is open-sourced at https://github.com/sapmitra/ssm-scope
翻译:新兴应用(如增强现实)正推动机器智能在本地设备上处理连续及/或长上下文输入的需求。然而,当前基于Transformer架构的主流模型受平方级计算与存储开销的制约,难以满足处理长上下文的应用需求。这催生了向状态空间模型(SSMs)及SSM-Transformer混合模型等新型架构的范式转变——这些模型实现了近线性扩展,能够在高效处理百万级token的同时保持高性能表现。尽管相关研究展现出前景,但其在计算性能与硬件资源需求方面的工作负载特性尚未得到深入探究,这限制了我们对其系统级优化意义的理解。为填补这一空白,我们对精选的Transformer、状态空间模型及混合模型进行了面向消费级与嵌入式GPU长上下文推理场景的系统性对比基准测试。分析表明,在消费级与嵌入式GPU上执行长上下文推理时,状态空间模型具备出色的端侧AI适配性。尽管Transformer在短序列(<8K tokens)上具有高达1.9倍的速度优势,但SSM凭借线性计算复杂度与约64%的内存占用缩减,在超长上下文(~57K tokens)场景中展现出显著的性能逆袭,速度最高可提升至4倍。算子级分析揭示,面向边缘平台运行时,即便是旨在最小化内存I/O的硬件感知型自定义SSM核函数(如选择性扫描),其顺序化逐元素运算特性仍会主导推理延迟,贡献超过55%的等待时间。SSM-Scope已开源至https://github.com/sapmitra/ssm-scope