Our experience of the world is multisensory, spanning a synthesis of language, sight, sound, touch, taste, and smell. Yet, artificial intelligence has primarily advanced in digital modalities like text, vision, and audio. This paper outlines a research vision for multisensory artificial intelligence over the next decade. This new set of technologies can change how humans and AI experience and interact with one another, by connecting AI to the human senses and a rich spectrum of signals from physiological and tactile cues on the body, to physical and social signals in homes, cities, and the environment. We outline how this field must advance through three interrelated themes of sensing, science, and synergy. Firstly, research in sensing should extend how AI captures the world in richer ways beyond the digital medium. Secondly, developing a principled science for quantifying multimodal heterogeneity and interactions, developing unified modeling architectures and representations, and understanding cross-modal transfer. Finally, we present new technical challenges to learn synergy between modalities and between humans and AI, covering multisensory integration, alignment, reasoning, generation, generalization, and experience. Accompanying this vision paper are a series of projects, resources, and demos of latest advances from the Multisensory Intelligence group at the MIT Media Lab, see https://mit-mi.github.io/.
翻译:我们对世界的体验是多感官的,涵盖语言、视觉、听觉、触觉、味觉和嗅觉的综合感知。然而,人工智能的发展主要集中于文本、视觉和音频等数字模态。本文提出了未来十年多感官人工智能的研究愿景。通过将人工智能与人类感官及丰富信号谱相连接——从身体的生理与触觉线索,到家庭、城市及环境中的物理与社会信号——这一系列新技术能够改变人类与人工智能相互体验及交互的方式。我们阐述了该领域应如何通过感知、科学与协同这三个相互关联的主题推进发展。首先,感知研究应拓展人工智能超越数字媒介、以更丰富方式捕捉世界的能力。其次,需建立量化多模态异质性与相互作用的原理性科学,开发统一的建模架构与表征方法,并理解跨模态迁移机制。最后,我们提出了学习模态间及人机间协同作用的新技术挑战,涵盖多感官整合、对齐、推理、生成、泛化与体验。随本愿景文件附上的还有麻省理工学院媒体实验室多感官智能小组的最新进展项目集、资源库及演示案例,详见 https://mit-mi.github.io/。