Vision Mamba, as a state space model (SSM), employs a zero-order hold (ZOH) discretization, which assumes that input signals remain constant between sampling instants. This assumption degrades temporal fidelity in dynamic visual environments and constrains the attainable accuracy of modern SSM-based vision models. In this paper, we present a systematic and controlled comparison of six discretization schemes instantiated within the Vision Mamba framework: ZOH, first-order hold (FOH), bilinear/Tustin transform (BIL), polynomial interpolation (POL), higher-order hold (HOH), and the fourth-order Runge-Kutta method (RK4). We evaluate each method on standard visual benchmarks to quantify its influence in image classification, semantic segmentation, and object detection. Our results demonstrate that POL and HOH yield the largest gains in accuracy at the cost of higher training-time computation. In contrast, the BIL provides consistent improvements over ZOH with modest additional overhead, offering the most favorable trade-off between precision and efficiency. These findings elucidate the pivotal role of discretization in SSM-based vision architectures and furnish empirically grounded justification for adopting BIL as the default discretization baseline for state-of-the-art SSM models.
翻译:视觉曼巴作为状态空间模型(SSM),采用零阶保持(ZOH)离散化方法,该方法假设输入信号在采样时刻之间保持恒定。这一假设在动态视觉环境中会降低时间保真度,并限制了现代基于SSM的视觉模型所能达到的精度。本文在视觉曼巴框架内对六种离散化方案进行了系统且可控的对比研究:零阶保持(ZOH)、一阶保持(FOH)、双线性/图斯廷变换(BIL)、多项式插值(POL)、高阶保持(HOH)以及四阶龙格-库塔方法(RK4)。我们在标准视觉基准上评估每种方法,以量化其对图像分类、语义分割和目标检测任务的影响。结果表明,POL和HOH在精度上带来最大提升,但以增加训练阶段计算开销为代价。相比之下,BIL在适度增加额外计算负担的同时,提供了优于ZOH的持续改进,在精度与效率之间实现了最佳权衡。这些发现阐明了离散化在基于SSM的视觉架构中的关键作用,并为将BIL作为最先进SSM模型的默认离散化基线提供了实证依据。