In this paper, we present TOSS, which introduces text to the task of novel view synthesis (NVS) from just a single RGB image. While Zero-1-to-3 has demonstrated impressive zero-shot open-set NVS capability, it treats NVS as a pure image-to-image translation problem. This approach suffers from the challengingly under-constrained nature of single-view NVS: the process lacks means of explicit user control and often results in implausible NVS generations. To address this limitation, TOSS uses text as high-level semantic information to constrain the NVS solution space. TOSS fine-tunes text-to-image Stable Diffusion pre-trained on large-scale text-image pairs and introduces modules specifically tailored to image and camera pose conditioning, as well as dedicated training for pose correctness and preservation of fine details. Comprehensive experiments are conducted with results showing that our proposed TOSS outperforms Zero-1-to-3 with more plausible, controllable and multiview-consistent NVS results. We further support these results with comprehensive ablations that underscore the effectiveness and potential of the introduced semantic guidance and architecture design.
翻译:本文提出TOSS,将文本引入到仅基于单张RGB图像的新视角合成任务中。尽管Zero-1-to-3已展现出令人印象深刻的零样本开放集NVS能力,但其将NVS视为纯粹的图像到图像翻译问题。该方法受限于单视图NVS的高度欠约束特性:缺乏显式用户控制手段,且常导致不合理的NVS生成结果。为解决这一局限,TOSS利用文本作为高层语义信息来约束NVS解空间。TOSS在大规模文本-图像对预训练的文本到图像Stable Diffusion模型上进行微调,并引入了专门针对图像和相机姿态条件化的模块,以及针对姿态正确性和细节保留的专门训练。综合实验结果表明,所提出的TOSS在生成更合理、可控且多视图一致的NVS结果方面优于Zero-1-to-3。我们通过全面的消融实验进一步支撑了这些结果,突显了所引入的语义引导与架构设计的有效性和潜力。