3D Vision-Language Pre-training (3D-VLP) aims to provide a pre-train model which can bridge 3D scenes with natural language, which is an important technique for embodied intelligence. However, current 3D-VLP datasets are hindered by limited scene-level diversity and insufficient fine-grained annotations (only 1.2K scenes and 280K textual annotations in ScanScribe), primarily due to the labor-intensive of collecting and annotating 3D scenes. To overcome these obstacles, we construct SynVL3D, a comprehensive synthetic scene-text corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels, which has the advantages of diverse scene data, rich textual descriptions, multi-grained 3D-text associations, and low collection cost. Utilizing the rich annotations in SynVL3D, we pre-train a simple and unified Transformer for aligning 3D and language with multi-grained pretraining tasks. Moreover, we propose a synthetic-to-real domain adaptation in downstream task fine-tuning process to address the domain shift. Through extensive experiments, we verify the effectiveness of our model design by achieving state-of-the-art performance on downstream tasks including visual grounding, dense captioning, and question answering.
翻译:三维视觉-语言预训练(3D-VLP)旨在构建能够连接三维场景与自然语言的预训练模型,是实现具身智能的关键技术。然而,当前3D-VLP数据集受限于场景多样性不足与细粒度标注稀缺(如ScanScribe仅包含1.2K场景和280K文本标注),主要源于三维场景采集与标注的高人力成本。为突破此瓶颈,我们构建了SynVL3D——一个包含10K室内场景及100万条物体级、视角级与房间级描述的综合合成场景-文本语料库,其具备场景数据多样、文本描述丰富、三维-文本关联多粒度化以及采集成本低廉的优势。利用SynVL3D中的密集标注,我们通过多粒度预训练任务,预训练了一个简洁统一的三维-语言对齐Transformer模型。此外,针对下游任务微调过程中的域偏移问题,我们提出了合成到真实场景的域适应方法。大量实验表明,我们的模型在视觉定位、密集描述生成和问答等下游任务中均达到最优性能,验证了模型设计的有效性。