The primary aim of this manuscript is to underscore a significant limitation in current deep learning models, particularly vision models. Unlike human vision, which efficiently selects only the essential visual areas for further processing, leading to high speed and low energy consumption, deep vision models process the entire image. In this work, we examine this issue from a broader perspective and propose two solutions that could pave the way for the next generation of more efficient vision models. In the first solution, convolution and pooling operations are selectively applied to altered regions, with a change map sent to subsequent layers. This map indicates which computations need to be repeated. In the second solution, only the modified regions are processed by a semantic segmentation model, and the resulting segments are inserted into the corresponding areas of the previous output map. The code is available at https://github.com/aliborji/spatial_attention.
翻译:本文的主要目的是强调当前深度学习模型,特别是视觉模型存在的一个显著局限性。与人类视觉系统不同,后者能够高效地仅选择必要的视觉区域进行后续处理,从而实现高速和低能耗;而深度视觉模型则需处理整幅图像。在本研究中,我们从更广阔的视角审视这一问题,并提出了两种解决方案,这些方案可能为开发下一代更高效的视觉模型铺平道路。在第一种解决方案中,卷积和池化操作被选择性地应用于图像中发生变化的区域,同时将一张变化图传递至后续网络层。该变化图指示了哪些计算需要被重复执行。在第二种解决方案中,仅由修改过的区域被送入一个语义分割模型进行处理,生成的分割结果随后被插入到先前输出图的对应区域中。相关代码可在 https://github.com/aliborji/spatial_attention 获取。