Text-to-image generation has made remarkable progress with the emergence of diffusion models. However, it is still a difficult task to generate images for street views based on text, mainly because the road topology of street scenes is complex, the traffic status is diverse and the weather condition is various, which makes conventional text-to-image models difficult to deal with. To address these challenges, we propose a novel controllable text-to-image framework, named \textbf{Text2Street}. In the framework, we first introduce the lane-aware road topology generator, which achieves text-to-map generation with the accurate road structure and lane lines armed with the counting adapter, realizing the controllable road topology generation. Then, the position-based object layout generator is proposed to obtain text-to-layout generation through an object-level bounding box diffusion strategy, realizing the controllable traffic object layout generation. Finally, the multiple control image generator is designed to integrate the road topology, object layout and weather description to realize controllable street-view image generation. Extensive experiments show that the proposed approach achieves controllable street-view text-to-image generation and validates the effectiveness of the Text2Street framework for street views.
翻译:随着扩散模型的出现,文本到图像生成取得了显著进展。然而,基于文本生成街景图像仍然是一项困难的任务,主要因为街景的道路拓扑复杂、交通状态多样且天气条件多变,使得传统文本到图像模型难以应对。为解决这些挑战,我们提出了一种名为\textbf{Text2Street}的新型可控文本到图像框架。在该框架中,我们首先引入了车道感知道路拓扑生成器,该生成器通过计数适配器实现具有精确道路结构和车道线的文本到地图生成,从而实现可控的道路拓扑生成。接着,我们提出基于位置的目标布局生成器,通过目标级边界框扩散策略实现文本到布局生成,从而实现对交通目标布局的可控。最后,我们设计了多控制图像生成器,将道路拓扑、目标布局和天气描述整合起来,实现可控的街景图像生成。大量实验表明,所提方法实现了可控的街景文本到图像生成,并验证了Text2Street框架在街景领域的有效性。