This report presents our team's solutions for the Track 1 of the 2024 ECCV ROAD++ Challenge. The task of Track 1 is spatiotemporal agent detection, which aims to construct an "agent tube" for road agents in consecutive video frames. Our solutions focus on the challenges in this task, including extreme-size objects, low-light scenarios, class imbalance, and fine-grained classification. Firstly, the extreme-size object detection heads are introduced to improve the detection performance of large and small objects. Secondly, we design a dual-stream detection model with a low-light enhancement stream to improve the performance of spatiotemporal agent detection in low-light scenes, and the feature fusion module to integrate features from different branches. Subsequently, we develop a multi-branch detection framework to mitigate the issues of class imbalance and fine-grained classification, and we design a pre-training and fine-tuning approach to optimize the above multi-branch framework. Besides, we employ some common data augmentation techniques, and improve the loss function and upsampling operation. We rank first in the test set of Track 1 for the ROAD++ Challenge 2024, and achieve 30.82% average video-mAP.
翻译:本报告介绍了我们团队在2024年ECCV ROAD++挑战赛Track 1中的解决方案。Track 1的任务是时空智能体检测,其目标是在连续视频帧中为道路智能体构建“智能体管道”。我们的解决方案聚焦于该任务中的挑战,包括极端尺度目标、低光照场景、类别不平衡以及细粒度分类。首先,我们引入了极端尺度目标检测头以提升大目标和小目标的检测性能。其次,我们设计了一个包含低光照增强流的双流检测模型,以提升低光照场景下的时空智能体检测性能,并利用特征融合模块来整合来自不同分支的特征。随后,我们开发了一个多分支检测框架以缓解类别不平衡和细粒度分类问题,并设计了一种预训练与微调方法来优化上述多分支框架。此外,我们采用了一些常见的数据增强技术,并改进了损失函数和上采样操作。我们在ROAD++挑战赛2024年Track 1的测试集上排名第一,并取得了30.82%的平均视频mAP。