Many perception systems in mobile computing, autonomous navigation, and AR/VR face strict compute constraints that are particularly challenging for high-resolution input images. Previous works propose nonuniform downsamplers that "learn to zoom" on salient image regions, reducing compute while retaining task-relevant image information. However, for tasks with spatial labels (such as 2D/3D object detection and semantic segmentation), such distortions may harm performance. In this work (LZU), we "learn to zoom" in on the input image, compute spatial features, and then "unzoom" to revert any deformations. To enable efficient and differentiable unzooming, we approximate the zooming warp with a piecewise bilinear mapping that is invertible. LZU can be applied to any task with 2D spatial input and any model with 2D spatial features, and we demonstrate this versatility by evaluating on a variety of tasks and datasets: object detection on Argoverse-HD, semantic segmentation on Cityscapes, and monocular 3D object detection on nuScenes. Interestingly, we observe boosts in performance even when high-resolution sensor data is unavailable, implying that LZU can be used to "learn to upsample" as well.
翻译:许多移动计算、自主导航和AR/VR中的感知系统面临严格的计算约束,尤其是处理高分辨率输入图像时挑战尤为突出。以往研究提出了非均匀降采样方法,通过“学习缩放”聚焦于图像中的显著区域,在减少计算量的同时保留与任务相关的图像信息。然而,对于具有空间标签的任务(如2D/3D目标检测和语义分割),此类形变可能损害性能。在本工作(LZU)中,我们“学习缩放”输入图像、计算空间特征,然后通过“反缩放”还原所有形变。为实现高效且可微的反缩放,我们采用可逆的分段双线性映射近似缩放扭曲。LZU可应用于任何具有二维空间输入的任务及任何具有二维空间特征的模型,我们通过在多种任务和数据集上的评估验证了其通用性:Argoverse-HD上的目标检测、Cityscapes上的语义分割以及nuScenes上的单目3D目标检测。有趣的是,即便在缺乏高分辨率传感器数据时,我们也观察到性能提升,这表明LZU还可用于“学习上采样”。