Visual spatial description (VSD) aims to generate texts that describe the spatial relations of the given objects within images. Existing VSD work merely models the 2D geometrical vision features, thus inevitably falling prey to the problem of skewed spatial understanding of target objects. In this work, we investigate the incorporation of 3D scene features for VSD. With an external 3D scene extractor, we obtain the 3D objects and scene features for input images, based on which we construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes. Besides, we propose a scene subgraph selecting mechanism, sampling topologically-diverse subgraphs from Go3D-S2G, where the diverse local structure features are navigated to yield spatially-diversified text generation. Experimental results on two VSD datasets demonstrate that our framework outperforms the baselines significantly, especially improving on the cases with complex visual spatial relations. Meanwhile, our method can produce more spatially-diversified generation. Code is available at https://github.com/zhaoyucs/VSD.
翻译:视觉空间描述旨在生成描述图像中给定对象之间空间关系的文本。现有视觉空间描述工作仅建模2D几何视觉特征,因此不可避免地陷入目标对象空间理解偏差的问题。本研究探索将3D场景特征融入视觉空间描述,通过外部3D场景提取器获取输入图像的3D对象与场景特征,并在此基础上构建以目标对象为中心的3D空间场景图(Go3D-S2G),从而在整体3D场景中对目标对象的空间语义进行建模。此外,我们提出场景子图选择机制,从Go3D-S2G中采样拓扑结构多样的子图,利用多样化的局部结构特征引导生成具有空间差异性的文本。在两个视觉空间描述数据集上的实验表明,我们的框架显著优于基线模型,尤其在处理复杂视觉空间关系时表现更优。同时,我们的方法能生成更具空间多样性的文本。代码已开源至https://github.com/zhaoyucs/VSD。