Recent successes in image synthesis are powered by large-scale diffusion models. However, most methods are currently limited to either text- or image-conditioned generation for synthesizing an entire image, texture transfer or inserting objects into a user-specified region. In contrast, in this work we focus on synthesizing complex interactions (ie, an articulated hand) with a given object. Given an RGB image of an object, we aim to hallucinate plausible images of a human hand interacting with it. We propose a two-step generative approach: a LayoutNet that samples an articulation-agnostic hand-object-interaction layout, and a ContentNet that synthesizes images of a hand grasping the object given the predicted layout. Both are built on top of a large-scale pretrained diffusion model to make use of its latent representation. Compared to baselines, the proposed method is shown to generalize better to novel objects and perform surprisingly well on out-of-distribution in-the-wild scenes of portable-sized objects. The resulting system allows us to predict descriptive affordance information, such as hand articulation and approaching orientation. Project page: https://judyye.github.io/affordiffusion-www
翻译:图像合成的最新成功得益于大规模扩散模型。然而,目前大多数方法仅限于基于文本或图像条件的生成,用于合成完整图像、纹理转移或向用户指定区域插入物体。与此不同,本研究聚焦于合成与给定物体之间的复杂交互(即铰接手部)。给定物体的RGB图像,我们的目标是推断出人手与其交互的合理图像。我们提出了一种两阶段生成方法:LayoutNet负责采样与关节无关的手-物体交互布局,而ContentNet则根据预测的布局合成手部抓取物体的图像。两者均构建于大规模预训练扩散模型之上,以利用其潜在表示能力。与基线方法相比,所提方法在泛化到新物体方面表现更优,并在便携式尺寸物体的分布外真实场景中展现出令人惊讶的良好性能。该系统能够预测描述性的功能信息,例如手部关节姿态和接近方向。项目页面:https://judyye.github.io/affordiffusion-www