Visual spatial description (VSD) aims to generate texts that describe the spatial relations of the given objects within images. Existing VSD work merely models the 2D geometrical vision features, thus inevitably falling prey to the problem of skewed spatial understanding of target objects. In this work, we investigate the incorporation of 3D scene features for VSD. With an external 3D scene extractor, we obtain the 3D objects and scene features for input images, based on which we construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes. Besides, we propose a scene subgraph selecting mechanism, sampling topologically-diverse subgraphs from Go3D-S2G, where the diverse local structure features are navigated to yield spatially-diversified text generation. Experimental results on two VSD datasets demonstrate that our framework outperforms the baselines significantly, especially improving on the cases with complex visual spatial relations. Meanwhile, our method can produce more spatially-diversified generation. Code is available at https://github.com/zhaoyucs/VSD.
翻译:视觉空间描述(VSD)旨在生成描述图像中指定对象空间关系的文本。现有VSD工作仅建模2D几何视觉特征,因而不可避免地受困于目标对象空间理解偏斜的问题。本文研究将3D场景特征融入VSD的方法。通过外部3D场景提取器,我们获取输入图像的3D对象与场景特征,并据此构建以目标对象为中心的3D空间场景图(Go3D-S2G),从而在整体3D场景中对目标对象的空间语义进行建模。此外,我们提出一种场景子图选择机制,从Go3D-S2G中采样拓扑多样化的子图,利用其中多样的局部结构特征引导生成具有空间多样性的文本。在两个VSD数据集上的实验结果表明,我们的框架显著优于基线方法,尤其在处理复杂视觉空间关系时表现突出。同时,我们的方法能生成更具空间多样性的文本。代码已开源:https://github.com/zhaoyucs/VSD。