The whole is greater than the sum of its parts-even in 3D-text contrastive learning. We introduce SceneForge, a novel framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions. SceneForge leverages individual 3D shapes to construct multi-object scenes with explicit spatial relations, pairing them with coherent multi-object descriptions refined by a large language model. By augmenting contrastive training with these structured, compositional samples, SceneForge effectively addresses the scarcity of large-scale 3D-text datasets, significantly enriching data complexity and diversity. We systematically investigate critical design elements, such as the optimal number of objects per scene, the proportion of compositional samples in training batches, and scene construction strategies. Extensive experiments demonstrate that SceneForge delivers substantial performance gains across multiple tasks, including zero-shot classification on ModelNet, ScanObjNN, Objaverse-LVIS, and ScanNet, as well as few-shot part segmentation on ShapeNetPart. SceneForge's compositional augmentations are model-agnostic, consistently improving performance across multiple encoder architectures. Moreover, SceneForge improves 3D visual question answering on ScanQA, generalizes robustly to retrieval scenarios with increasing scene complexity, and showcases spatial reasoning capabilities by adapting spatial configurations to align precisely with textual instructions.
翻译:整体大于部分之和——在3D-文本对比学习中也同样如此。我们提出了SceneForge,一个通过结构化多物体场景组合来增强3D点云与文本之间对比对齐的新颖框架。SceneForge利用单个3D形状构建具有明确空间关系的多物体场景,并将其与由大语言模型生成的连贯多物体描述进行配对。通过在对比训练中融入这些结构化、组合式的样本,SceneForge有效缓解了大规模3D-文本数据集的稀缺性问题,显著提升了数据的复杂性和多样性。我们系统地研究了关键的设计要素,例如每个场景中的最佳物体数量、训练批次中组合样本的比例以及场景构建策略。大量实验表明,SceneForge在多项任务上带来了显著的性能提升,包括在ModelNet、ScanObjNN、Objaverse-LVIS和ScanNet上的零样本分类,以及在ShapeNetPart上的少样本部件分割。SceneForge的组合式增强方法是模型无关的,能够持续改进多种编码器架构的性能。此外,SceneForge提升了在ScanQA上的3D视觉问答能力,在场景复杂度增加的检索场景中展现出强大的泛化能力,并通过调整空间配置以精确匹配文本指令,展示了其空间推理能力。