Enhancing Small Object Encoding in Deep Neural Networks: Introducing Fast&Focused-Net with Volume-wise Dot Product Layer

In this paper, we introduce Fast&Focused-Net, a novel deep neural network architecture tailored for efficiently encoding small objects into fixed-length feature vectors. Contrary to conventional Convolutional Neural Networks (CNNs), Fast&Focused-Net employs a series of our newly proposed layer, the Volume-wise Dot Product (VDP) layer, designed to address several inherent limitations of CNNs. Specifically, CNNs often exhibit a smaller effective receptive field than their theoretical counterparts, limiting their vision span. Additionally, the initial layers in CNNs produce low-dimensional feature vectors, presenting a bottleneck for subsequent learning. Lastly, the computational overhead of CNNs, particularly in capturing diverse image regions by parameter sharing, is significantly high. The VDP layer, at the heart of Fast&Focused-Net, aims to remedy these issues by efficiently covering the entire image patch information with reduced computational demand. Experimental results demonstrate the prowess of Fast&Focused-Net in a variety of applications. For small object classification tasks, our network outperformed state-of-the-art methods on datasets such as CIFAR-10, CIFAR-100, STL-10, SVHN-Cropped, and Fashion-MNIST. In the context of larger image classification, when combined with a transformer encoder (ViT), Fast&Focused-Net produced competitive results for OpenImages V6, ImageNet-1K, and Places365 datasets. Moreover, the same combination showcased unparalleled performance in text recognition tasks across SVT, IC15, SVTP, and HOST datasets. This paper presents the architecture, the underlying motivation, and extensive empirical evidence suggesting that Fast&Focused-Net is a promising direction for efficient and focused deep learning.

翻译：本文提出Fast&Focused-Net，一种专为高效处理小目标并输出固定长度特征向量而设计的新型深度神经网络架构。与传统的卷积神经网络（CNN）不同，Fast&Focused-Net采用我们新提出的逐体点积（VDP）层系列，以解决CNN固有的若干局限性。具体而言，CNN的实际感受野往往小于理论值，限制了其视觉感知范围；此外，CNN初始层生成的低维特征向量构成后续学习的瓶颈；最后，CNN通过参数共享捕获多样图像区域时的计算开销显著。作为Fast&Focused-Net的核心，VDP层旨在通过降低计算需求的同时高效覆盖完整图像块信息来弥补上述不足。实验结果表明，Fast&Focused-Net在多种应用中展现出优异性能。在小目标分类任务中，我们的网络在CIFAR-10、CIFAR-100、STL-10、SVHN-Cropped和Fashion-MNIST数据集上均优于现有最优方法。在更大规模图像分类场景下，与Transformer编码器（ViT）结合后，Fast&Focused-Net在OpenImages V6、ImageNet-1K和Places365数据集上取得了具有竞争力的结果。此外，相同组合在SVT、IC15、SVTP和HOST数据集的文本识别任务中展现出无与伦比的表现。本文系统阐述了该架构的设计原理、核心动机及大量实证证据，表明Fast&Focused-Net为高效且聚焦的深度学习提供了富有前景的研究方向。