Recent years have witnessed impressive progress in super-resolution (SR) processing. However, its real-time inference requirement sets a challenge not only for the model design but also for the on-chip implementation. In this paper, we implement a full-stack SR acceleration framework on embedded GPU devices. The special dictionary learning algorithm used in SR models was analyzed in detail and accelerated via a novel dictionary selective strategy. Besides, the hardware programming architecture together with the model structure is analyzed to guide the optimal design of computation kernels to minimize the inference latency under the resource constraints. With these novel techniques, the communication and computation bottlenecks in the deep dictionary learning-based SR models are tackled perfectly. The experiments on the edge embedded NVIDIA NX and 2080Ti show that our method outperforms the state-of-the-art NVIDIA TensorRT significantly, and can achieve real-time performance.
翻译:近年来,超分辨率处理取得了显著进展。然而,其实时推理需求对模型设计与片上实现均提出了挑战。本文在嵌入式GPU设备上实现了一套全栈式超分辨率加速框架。我们详细分析了超分辨率模型中使用的特殊字典学习算法,并通过一种新颖的字典选择策略进行加速。此外,还结合模型结构对硬件编程架构进行了分析,以指导计算核的优化设计,从而在资源约束下最小化推理延迟。凭借这些创新技术,深度字典学习超分辨率模型中的通信与计算瓶颈得到了完美解决。在边缘嵌入式NVIDIA NX与2080Ti上的实验表明,我们的方法显著优于当前最先进的NVIDIA TensorRT,并能够实现实时性能。