Human perception inherently operates in a multimodal manner. Similarly, as machines interpret the empirical world, their learning processes ought to be multimodal. The recent, remarkable successes in empirical multimodal learning underscore the significance of understanding this paradigm. Yet, a solid theoretical foundation for multimodal learning has eluded the field for some time. While a recent study by Lu (2023) has shown the superior sample complexity of multimodal learning compared to its unimodal counterpart, another basic question remains: does multimodal learning also offer computational advantages over unimodal learning? This work initiates a study on the computational benefit of multimodal learning. We demonstrate that, under certain conditions, multimodal learning can outpace unimodal learning exponentially in terms of computation. Specifically, we present a learning task that is NP-hard for unimodal learning but is solvable in polynomial time by a multimodal algorithm. Our construction is based on a novel modification to the intersection of two half-spaces problem.
翻译:人类感知本质上以多模态方式运行。同样地,当机器解读经验世界时,其学习过程也应是多模态的。近期经验性多模态学习取得的显著成功,凸显了理解这一范式的重要性。然而,该领域在相当长一段时间内缺乏扎实的多模态学习理论基础。尽管Lu(2023)的最新研究已证明多模态学习相较于单模态学习具有更优的样本复杂度,但另一个基本问题依然存在:多模态学习是否也在计算上优于单模态学习?本研究首次探讨多模态学习的计算优势。我们证明,在特定条件下,多模态学习在计算效率上可呈指数级超越单模态学习。具体而言,我们提出了一个学习任务——该任务对于单模态学习而言是NP难的,但多模态算法可在多项式时间内求解。该构造基于对两个半空间交集问题的创新性改进。