Building multisensory AI systems that learn from multiple sensory inputs such as text, speech, video, real-world sensors, wearable devices, and medical data holds great promise for impact in many scientific areas with practical benefits, such as in supporting human health and well-being, enabling multimedia content processing, and enhancing real-world autonomous agents. By synthesizing a range of theoretical frameworks and application domains, this thesis aims to advance the machine learning foundations of multisensory AI. In the first part, we present a theoretical framework formalizing how modalities interact with each other to give rise to new information for a task. These interactions are the basic building blocks in all multimodal problems, and their quantification enables users to understand their multimodal datasets, design principled approaches to learn these interactions, and analyze whether their model has succeeded in learning. In the second part, we study the design of practical multimodal foundation models that generalize over many modalities and tasks, which presents a step toward grounding large language models to real-world sensory modalities. We introduce MultiBench, a unified large-scale benchmark across a wide range of modalities, tasks, and research areas, followed by the cross-modal attention and multimodal transformer architectures that now underpin many of today's multimodal foundation models. Scaling these architectures on MultiBench enables the creation of general-purpose multisensory AI systems, and we discuss our collaborative efforts in applying these models for real-world impact in affective computing, mental health, cancer prognosis, and robotics. Finally, we conclude this thesis by discussing how future work can leverage these ideas toward more general, interactive, and safe multisensory AI.
翻译:构建能够从多种感官输入(如文本、语音、视频、现实世界传感器、可穿戴设备和医疗数据)中学习的人工智能系统,在众多科学领域具有巨大潜力,并带来实际效益,例如支持人类健康与福祉、实现多媒体内容处理、以及增强现实世界自主智能体。通过综合一系列理论框架和应用领域,本论文旨在推进多感官人工智能的机器学习基础。在第一部分中,我们提出了一个理论框架,形式化了不同模态如何相互作用以产生新信息以服务于具体任务。这些交互是所有多模态问题中的基本构建单元,对其量化使用户能够理解其多模态数据集、设计有原则的方法来学习这些交互,并分析其模型是否成功捕捉了这些交互。在第二部分中,我们研究了实际多模态基础模型的设计,这些模型能够泛化到多种模态与任务,为将大语言模型扎根于现实世界感官模态迈出了重要一步。我们引入了MultiBench,这是一个涵盖广泛模态、任务和研究领域的统一大规模基准测试,随后介绍了跨模态注意力机制和多模态Transformer架构,这些如今成为许多多模态基础模型的基石。在MultiBench上扩展这些架构,使得创建通用多感官人工智能系统成为可能,并讨论了我们在将这些模型应用于情感计算、心理健康、癌症预后和机器人学等现实世界场景中的协作努力。最后,我们通过探讨未来工作如何利用这些思想构建更通用、交互性强且安全的多感官人工智能系统来总结本论文。