We present Farm3D, a method for learning category-specific 3D reconstructors for articulated objects, relying solely on "free" virtual supervision from a pre-trained 2D diffusion-based image generator. Recent approaches can learn a monocular network that predicts the 3D shape, albedo, illumination, and viewpoint of any object occurrence, given a collection of single-view images of an object category. However, these approaches heavily rely on manually curated clean training data, which are expensive to obtain. We propose a framework that uses an image generator, such as Stable Diffusion, to generate synthetic training data that are sufficiently clean and do not require further manual curation, enabling the learning of such a reconstruction network from scratch. Additionally, we incorporate the diffusion model as a score to enhance the learning process. The idea involves randomizing certain aspects of the reconstruction, such as viewpoint and illumination, generating virtual views of the reconstructed 3D object, and allowing the 2D network to assess the quality of the resulting image, thus providing feedback to the reconstructor. Unlike work based on distillation, which produces a single 3D asset for each textual prompt, our approach yields a monocular reconstruction network capable of outputting a controllable 3D asset from any given image, whether real or generated, in a single forward pass in a matter of seconds. Our network can be used for analysis, including monocular reconstruction, or for synthesis, generating articulated assets for real-time applications such as video games.
翻译:我们提出Farm3D方法,用于学习特定类别的可动三维重建器,仅依赖预训练二维扩散图像生成器提供的“免费”虚拟监督。近期方法能够从目标类别的单视角图像集合中训练出预测三维形状、反照率、光照及视角的单目网络。然而,这些方法严重依赖手工筛选的干净训练数据,获取成本高昂。本文提出框架利用图像生成器(如Stable Diffusion)生成足够干净且无需人工筛选的合成训练数据,使重建网络得以从零开始训练。此外,我们将扩散模型作为评分函数融入学习过程:通过随机化重建参数(如视角与光照)生成重建三维物体的虚拟视图,利用二维网络评估生成图像质量,从而为重建器提供反馈。与基于蒸馏的方法(仅针对每个文本提示生成单一三维资产)不同,本文方法可输出单目重建网络,该网络能在数秒内通过单次前向传播,从任意真实或生成图像中生成可控三维资产。该网络既可应用于分析任务(如单目重建),也可用于合成任务(如为视频游戏等实时应用生成可动资产)。