Recent successes in image synthesis are powered by large-scale diffusion models. However, most methods are currently limited to either text- or image-conditioned generation for synthesizing an entire image, texture transfer or inserting objects into a user-specified region. In contrast, in this work we focus on synthesizing complex interactions (ie, an articulated hand) with a given object. Given an RGB image of an object, we aim to hallucinate plausible images of a human hand interacting with it. We propose a two-step generative approach: a LayoutNet that samples an articulation-agnostic hand-object-interaction layout, and a ContentNet that synthesizes images of a hand grasping the object given the predicted layout. Both are built on top of a large-scale pretrained diffusion model to make use of its latent representation. Compared to baselines, the proposed method is shown to generalize better to novel objects and perform surprisingly well on out-of-distribution in-the-wild scenes of portable-sized objects. The resulting system allows us to predict descriptive affordance information, such as hand articulation and approaching orientation. Project page: https://judyye.github.io/affordiffusion-www
翻译:近期图像合成的成功得益于大规模扩散模型。然而,当前多数方法仅限于文本或图像条件生成——用于合成整幅图像、纹理迁移或在用户指定区域插入对象。与此不同,本工作聚焦于合成与给定物体的复杂交互(即关节手)。给定物体的RGB图像,我们旨在生成该物体与人手交互的合理图像。我们提出了一种两步生成方法:布局网络(LayoutNet)采样与关节无关的手-物交互布局,内容网络(ContentNet)根据预测布局合成手抓握物体的图像。两者均基于大规模预训练扩散模型构建,以利用其潜在表征。与基线方法相比,所提方法对未见物体的泛化能力更优,且在便携尺寸物体的分布外野外场景中表现出乎意料地好。该系统能预测描述性功能信息,如手部关节姿态与接近方向。项目页面:https://judyye.github.io/affordiffusion-www