We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.
翻译:我们指出,遮挡推理是三维布局条件生成的一个基础但被忽视的方面。它对于合成具有深度一致几何和尺度的部分遮挡物体至关重要。虽然现有方法可以生成遵循输入布局的真实感场景,但它们往往无法精确建模物体间的遮挡关系。我们提出了SeeThrough3D,一个用于三维布局条件生成的模型,它显式地建模遮挡。我们引入了一种遮挡感知的三维场景表示(OSCR),其中物体被描绘为放置在虚拟环境中的半透明三维框,并从期望的相机视角进行渲染。透明度编码了被隐藏的物体区域,使模型能够推理遮挡关系,而渲染的视角则在生成过程中提供了明确的相机控制。我们通过引入一组从渲染的三维表示中导出的视觉标记,对一个基于流的预训练文本到图像生成模型进行条件控制。此外,我们应用了掩码自注意力机制,以精确地将每个物体边界框与其对应的文本描述绑定,从而实现多个物体的准确生成,避免物体属性混淆。为了训练模型,我们构建了一个包含多样化多物体场景且具有强烈物体间遮挡关系的合成数据集。SeeThrough3D能够有效地泛化到未见过的物体类别,并实现具有真实遮挡和一致相机控制的精确三维布局控制。