In the era of big data and large models, automatic annotating functions for multi-modal data are of great significance for real-world AI-driven applications, such as autonomous driving and embodied AI. Unlike traditional closed-set annotation, open-vocabulary annotation is essential to achieve human-level cognition capability. However, there are few open-vocabulary auto-labeling systems for multi-modal 3D data. In this paper, we introduce OpenAnnotate3D, an open-source open-vocabulary auto-labeling system that can automatically generate 2D masks, 3D masks, and 3D bounding box annotations for vision and point cloud data. Our system integrates the chain-of-thought capabilities of Large Language Models (LLMs) and the cross-modality capabilities of vision-language models (VLMs). To the best of our knowledge, OpenAnnotate3D is one of the pioneering works for open-vocabulary multi-modal 3D auto-labeling. We conduct comprehensive evaluations on both public and in-house real-world datasets, which demonstrate that the system significantly improves annotation efficiency compared to manual annotation while providing accurate open-vocabulary auto-annotating results.
翻译:在大数据与大模型时代,多模态数据的自动标注功能对自动驾驶、具身智能等真实场景中的人工智能驱动应用具有重要意义。与传统的封闭集标注不同,开放词汇标注是实现人类级认知能力的关键。然而,目前针对多模态三维数据的开放词汇自动标注系统仍较为匮乏。本文介绍OpenAnnotate3D——一个开源的开放词汇自动标注系统,可自动为视觉与点云数据生成二维掩码、三维掩码及三维边界框标注。该系统融合了大语言模型的思维链能力与视觉-语言模型的跨模态能力。据我们所知,OpenAnnotate3D是开放词汇多模态三维自动标注领域的开创性工作之一。我们在公开数据集与内部真实世界数据集上进行了全面评估,结果表明,该系统相比人工标注显著提升了标注效率,同时能提供精确的开放词汇自动标注结果。