We propose a modular framework for hybrid image restoration that integrates transformer and state-space model (SSM) blocks with a focus on improving runtime efficiency on edge hardware. While transformers provide strong global modeling through self-attention, their attention kernels incur substantial latency on mobile devices, especially for high-resolution inputs. In contrast, SSMs such as Mamba offer lineartime sequence modeling with lower runtime overhead but may underperform on fine grained restoration tasks. To balance accuracy and efficiency, we train lightweight SSM blocks as feature-distilled surrogates of transformer blocks and use them to construct hybrid U-Net-style architectures. To automatically discover effective block combinations, we introduce Efficient Network Search (ENS), a multi-objective search strategy that selects task-specific hybrid configurations from pre-aligned components. ENS optimizes restoration quality while penalizing transformer usage, serving as a lightweight proxy for latency and enabling architecture discovery without repeated hardware profiling. On a Snapdragon 8 Elite CPU, the Restormer baseline requires 10119.52 ms for inference. In contrast, ENS-discovered hybrids significantly reduce runtime: ENS-Deblurring runs in 2973 ms (3.4x faster), ENS-Deraining in 5816 ms (1.74x faster), and ENS-Denoising in 8666 ms (1.17x faster), while maintaining competitive restoration quality.
翻译:我们提出了一种用于混合图像恢复的模块化框架,该框架整合了Transformer和状态空间模型(SSM)模块,重点提升边缘硬件上的运行时效率。尽管Transformer通过自注意力机制提供了强大的全局建模能力,但其注意力核在移动设备上会带来显著延迟,尤其对于高分辨率输入而言。相比之下,像Mamba这样的SSM以较低的运行时开销提供线性时间序列建模,但在细粒度恢复任务上可能表现欠佳。为平衡准确性与效率,我们将轻量级SSM模块训练为Transformer模块的特征蒸馏替代物,并用其构建混合U-Net风格架构。为实现自动发现有效的模块组合,我们引入了高效网络搜索(ENS)——一种多目标搜索策略,可从预对齐组件中选取特定任务的混合配置。ENS在优化恢复质量的同时惩罚Transformer的使用,以此作为延迟的轻量级代理,实现无需重复硬件剖析的架构发现。在骁龙8 Elite CPU上,Restormer基线推断需10119.52毫秒。相比之下,ENS发现的混合架构显著降低了运行时间:ENS去模糊模型运行时间为2973毫秒(速度提升3.4倍),ENS去雨模型为5816毫秒(速度提升1.74倍),ENS去噪模型为8666毫秒(速度提升1.17倍),同时保持有竞争力的恢复质量。