Octave-YOLO: Cross frequency detection network with octave convolution

Despite the rapid advancement of object detection algorithms, processing high-resolution images on embedded devices remains a significant challenge. Theoretically, the fully convolutional network architecture used in current real-time object detectors can handle all input resolutions. However, the substantial computational demands required to process high-resolution images render them impractical for real-time applications. To address this issue, real-time object detection models typically downsample the input image for inference, leading to a loss of detail and decreased accuracy. In response, we developed Octave-YOLO, designed to process high-resolution images in real-time within the constraints of embedded systems. We achieved this through the introduction of the cross frequency partial network (CFPNet), which divides the input feature map into low-resolution, low-frequency, and high-resolution, high-frequency sections. This configuration enables complex operations such as convolution bottlenecks and self-attention to be conducted exclusively on low-resolution feature maps while simultaneously preserving the details in high-resolution maps. Notably, this approach not only dramatically reduces the computational demands of convolution tasks but also allows for the integration of attention modules, which are typically challenging to implement in real-time applications, with minimal additional cost. Additionally, we have incorporated depthwise separable convolution into the core building blocks and downsampling layers to further decrease latency. Experimental results have shown that Octave-YOLO matches the performance of YOLOv8 while significantly reducing computational demands. For example, in 1080x1080 resolution, Octave-YOLO-N is 1.56 times faster than YOLOv8, achieving nearly the same accuracy on the COCO dataset with approximately 40 percent fewer parameters and FLOPs.

翻译：尽管目标检测算法发展迅速，但在嵌入式设备上处理高分辨率图像仍是一项重大挑战。理论上，当前实时目标检测器所使用的全卷积网络架构能够处理所有输入分辨率。然而，处理高分辨率图像所需的大量计算需求使其难以适用于实时应用。为解决这一问题，实时目标检测模型通常会对输入图像进行下采样以进行推理，这导致细节丢失和精度下降。为此，我们开发了Octave-YOLO，旨在嵌入式系统的限制下实时处理高分辨率图像。我们通过引入跨频部分网络（CFPNet）实现了这一目标，该网络将输入特征图划分为低分辨率低频部分和高分辨率高频部分。这种配置使得卷积瓶颈和自注意力等复杂操作能够仅在低分辨率特征图上进行，同时保留高分辨率特征图中的细节。值得注意的是，这种方法不仅显著降低了卷积任务的计算需求，还允许以最小的额外成本集成注意力模块（这类模块在实时应用中通常难以实现）。此外，我们在核心构建块和下采样层中引入了深度可分离卷积，以进一步降低延迟。实验结果表明，Octave-YOLO在显著降低计算需求的同时，性能与YOLOv8相当。例如，在1080x1080分辨率下，Octave-YOLO-N的速度是YOLOv8的1.56倍，在COCO数据集上达到几乎相同的精度，而参数量和FLOPs减少了约40%。