If a picture paints a thousand words, sound may voice a million. While recent robotic painting and image synthesis methods have achieved progress in generating visuals from text inputs, the translation of sound into images is vastly unexplored. Generally, sound-based interfaces and sonic interactions have the potential to expand accessibility and control for the user and provide a means to convey complex emotions and the dynamic aspects of the real world. In this paper, we propose an approach for using sound and speech to guide a robotic painting process, known here as robot synesthesia. For general sound, we encode the simulated paintings and input sounds into the same latent space. For speech, we decouple speech into its transcribed text and the tone of the speech. Whereas we use the text to control the content, we estimate the emotions from the tone to guide the mood of the painting. Our approach has been fully integrated with FRIDA, a robotic painting framework, adding sound and speech to FRIDA's existing input modalities, such as text and style. In two surveys, participants were able to correctly guess the emotion or natural sound used to generate a given painting more than twice as likely as random chance. On our sound-guided image manipulation and music-guided paintings, we discuss the results qualitatively.
翻译:如果一幅画能描绘千言万语,声音或许能诉说万千心声。尽管近期机器人绘画与图像合成方法在从文本输入生成视觉内容方面取得了进展,但声音到图像的转换仍鲜有探索。一般而言,基于声音的界面和交互方式能够扩展用户的可访问性与控制能力,并为传达复杂情感及现实世界的动态特征提供途径。本文提出一种利用声音与语音引导机器人绘画流程的方法,即"机器人共感"。对于一般声音,我们将模拟画作与输入声音编码至同一潜在空间;对于语音,则将其解耦为转写文本与语音语调。我们利用文本控制画作内容,并通过语调估计情感以引导画作基调。该方法已完整集成至机器人绘画框架FRIDA中,为其现有输入模态(如文本与风格)增添了声音与语音功能。两项调查结果显示,参与者正确猜出生成画作所对应的情感或自然声音的概率,是随机猜测的两倍以上。针对声音引导的图像处理与音乐引导的绘画案例,我们定性讨论了相关实验结果。