Converting different modalities into generalized text, which then serves as input prompts for large language models (LLMs), is a common approach for aligning multimodal models, particularly when pairwise data is limited. Text-centric alignment method leverages the unique properties of text as a modality space, transforming diverse inputs into a unified textual representation, thereby enabling downstream models to effectively interpret various modal inputs. This study evaluates the quality and robustness of multimodal representations in the face of noise imperfections, dynamic input order permutations, and missing modalities, revealing that current text-centric alignment methods can compromise downstream robustness. To address this issue, we propose a new text-centric adversarial training approach that significantly enhances robustness compared to traditional robust training methods and pre-trained multimodal foundation models. Our findings underscore the potential of this approach to improve the robustness and adaptability of multimodal representations, offering a promising solution for dynamic and real-world applications.
翻译:将不同模态转换为通用文本,并以此作为大型语言模型(LLMs)的输入提示,是多模态模型对齐的常用方法,尤其在成对数据有限的情况下。文本中心对齐方法利用文本作为模态空间的独特属性,将多样化输入转化为统一的文本表示,从而使下游模型能够有效解析各类模态输入。本研究评估了多模态表征在面对噪声缺陷、动态输入顺序排列及模态缺失时的质量与鲁棒性,揭示了当前文本中心对齐方法可能损害下游鲁棒性的问题。为解决此问题,我们提出了一种新的文本中心对抗性训练方法,相较于传统鲁棒训练方法与预训练多模态基础模型,该方法显著提升了鲁棒性。我们的研究结果凸显了该方法在提升多模态表征的鲁棒性与适应性方面的潜力,为动态及现实世界应用提供了前景广阔的解决方案。