An emerging line of work has sought to generate plausible imagery from touch. Existing approaches, however, tackle only narrow aspects of the visuo-tactile synthesis problem, and lag significantly behind the quality of cross-modal synthesis methods in other domains. We draw on recent advances in latent diffusion to create a model for synthesizing images from tactile signals (and vice versa) and apply it to a number of visuo-tactile synthesis tasks. Using this model, we significantly outperform prior work on the tactile-driven stylization problem, i.e., manipulating an image to match a touch signal, and we are the first to successfully generate images from touch without additional sources of information about the scene. We also successfully use our model to address two novel synthesis problems: generating images that do not contain the touch sensor or the hand holding it, and estimating an image's shading from its reflectance and touch.
翻译:新兴的研究方向致力于从触觉信号生成合理的图像。然而,现有方法仅涉及视觉-触觉合成问题的狭窄方面,其质量远落后于其他领域跨模态合成方法的水平。我们利用潜在扩散模型的最新进展,构建了一个能从触觉信号合成图像(反之亦然)的模型,并将其应用于多项视觉-触觉合成任务。通过该模型,我们在触觉驱动风格化问题(即根据触觉信号操控图像匹配)上显著超越先前工作,且首次成功仅凭触觉生成图像(无需额外场景信息)。我们还成功利用该模型解决了两个全新的合成问题:生成不包含触觉传感器或握持传感器的手的图像,以及从反射率和触觉估计图像的明暗分布。