Perceiving multi-modal information and fulfilling dialogues with humans is a long-term goal of artificial intelligence. Pre-training is commonly regarded as an effective approach for multi-modal dialogue. However, due to the limited availability of multi-modal dialogue data, there is still scarce research on multi-modal dialogue pre-training. Yet another intriguing challenge emerges from the encompassing nature of multi-modal dialogue, which involves various modalities and tasks. Moreover, new forms of tasks may arise at unpredictable points in the future. Hence, it is essential for designed multi-modal dialogue models to possess sufficient flexibility to adapt to such scenarios. This paper proposes \textbf{PaCE}, a unified, structured, compositional multi-modal dialogue pre-training framework. It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data. Furthermore, we propose a progressive training method where old experts from the past can assist new experts, facilitating the expansion of their capabilities. Experimental results demonstrate that PaCE achieves state-of-the-art results on eight multi-modal dialog benchmarks.
翻译:感知多模态信息并与人类进行对话是人工智能的长期目标。预训练通常被视为多模态对话的有效方法。然而,由于多模态对话数据的可用性有限,关于多模态对话预训练的研究仍然稀缺。另一个引人注目的挑战源于多模态对话的包容性,它涉及多种模态和任务。此外,新形式的任务可能在未来不可预测的时间点出现。因此,设计的多模态对话模型必须具有足够的灵活性以适应此类场景。本文提出了 **PaCE**,一个统一、结构化、组合式的多模态对话预训练框架。它利用多个基础专家的组合来适应多种对话相关任务,并且可以使用有限的对话数据和大量的非对话多模态数据进行预训练。此外,我们提出了一种渐进式训练方法,其中过去的旧专家可以帮助新专家,从而促进其能力的扩展。实验结果表明,PaCE 在八个多模态对话基准测试中取得了最先进的结果。