Understanding vehicles in images is important for various applications such as intelligent transportation and self-driving system. Existing vehicle-centric works typically pre-train models on large-scale classification datasets and then fine-tune them for specific downstream tasks. However, they neglect the specific characteristics of vehicle perception in different tasks and might thus lead to sub-optimal performance. To address this issue, we propose a novel vehicle-centric pre-training framework called VehicleMAE, which incorporates the structural information including the spatial structure from vehicle profile information and the semantic structure from informative high-level natural language descriptions for effective masked vehicle appearance reconstruction. To be specific, we explicitly extract the sketch lines of vehicles as a form of the spatial structure to guide vehicle reconstruction. The more comprehensive knowledge distilled from the CLIP big model based on the similarity between the paired/unpaired vehicle image-text sample is further taken into consideration to help achieve a better understanding of vehicles. A large-scale dataset is built to pre-train our model, termed Autobot1M, which contains about 1M vehicle images and 12693 text information. Extensive experiments on four vehicle-based downstream tasks fully validated the effectiveness of our VehicleMAE. The source code and pre-trained models will be released at https://github.com/Event-AHU/VehicleMAE.
翻译:理解图像中的车辆对于智能交通和自动驾驶系统等应用至关重要。现有的车辆中心方法通常在大规模分类数据集上预训练模型,然后针对特定下游任务进行微调。然而,这些方法忽略了不同任务中车辆感知的具体特性,可能导致性能次优。为解决此问题,我们提出了一种新颖的车辆中心预训练框架VehicleMAE,该框架融合了结构信息——包括来自车辆轮廓信息的空间结构和来自信息丰富的高级自然语言描述的语义结构——以进行有效的掩码车辆外观重建。具体而言,我们显式提取车辆的轮廓线作为空间结构的一种形式,以指导车辆重建。基于配对/非配对车辆图像-文本样本相似性,从CLIP大模型中蒸馏出的更全面知识被进一步考虑,以帮助实现对车辆的更好理解。我们构建了一个大规模数据集Autobot1M用于模型预训练,该数据集包含约100万张车辆图像和12693条文本信息。在四个基于车辆的下游任务上进行的广泛实验充分验证了我们VehicleMAE的有效性。源代码和预训练模型将在https://github.com/Event-AHU/VehicleMAE发布。