Satellite video object detection (SVOD) for oriented and fine-grained objects plays an important role in satellite applications. Most existing SVOD methods only focus on one or a few coarse-grained categories of moving objects and represent objects with horizontal bounding boxes. They have difficulty extracting complete, accurate, and consistent information about objects in whole satellite videos. In this paper, we propose a satellite video object detection framework based on Temporal Consistency Learning (TCL). TCL adeptly detects oriented and fine-grained objects by leveraging the rich temporal contexts within satellite videos. The framework integrates three key modules: temporal and fine-grained feature aggregation (TFA), structure encoding (SE), and temporal consistency constraint (TCC). TFA and TCC modules facilitate consistent representation learning across frames, while the SE module encodes both appearance and structural information for precise fine-grained recognition. Experimental results on the SAT-MTB benchmark dataset demonstrate TCL's superior performance, achieving a new state-of-the-art oriented and fine-grained detection accuracy of 47.7% mAP--a 4.8% improvement over the baseline. Furthermore, our TCL framework readily accommodates existing image-based detectors, leading to enhanced detection accuracies.
翻译:卫星视频中面向定向细粒度目标的目标检测(SVOD)在卫星应用中扮演着重要角色。现有SVOD方法大多仅关注一种或几种粗粒度类别的运动目标,并使用水平边界框表示目标。这些方法难以从整段卫星视频中提取完整、准确且一致的目标信息。本文提出了一种基于时域一致性学习(TCL)的卫星视频目标检测框架。TCL通过利用卫星视频中丰富的时域上下文信息,能够灵活地检测定向细粒度目标。该框架集成了三个关键模块:时域与细粒度特征聚合模块(TFA)、结构编码模块(SE)以及时域一致性约束模块(TCC)。TFA与TCC模块促进跨帧的一致性表示学习,而SE模块则同时编码外观与结构信息以实现精确的细粒度识别。在SAT-MTB基准数据集上的实验结果表明,TCL展现出卓越性能,在定向细粒度检测精度上达到了47.7% mAP的最新水平——相较于基线方法提升了4.8%。此外,TCL框架可便捷地适配现有的基于图像检测器,从而显著提升检测精度。