Multi-modal large language models (MLLMs) are trained based on large language models (LLM), with an enhanced capability to comprehend multi-modal inputs and generate textual responses. While they excel in multi-modal tasks, the pure NLP abilities of MLLMs are often underestimated and left untested. In this study, we get out of the box and unveil an intriguing characteristic of MLLMs -- our preliminary results suggest that visual instruction tuning, a prevailing strategy for transitioning LLMs into MLLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment in the pure NLP context. For example, a visual-instruction-tuned LLaMA2 7B model surpasses the performance of the LLaMA2-chat 7B model, fine-tuned with over one million human annotations, on TruthfulQA-mc and Ethics benchmarks. Further analysis reveals that the improved alignment can be attributed to the superior instruction quality inherent to visual-text data. In releasing our code at github.com/UCSC-VLAA/Sight-Beyond-Text, we aspire to foster further exploration into the intrinsic value of visual-text synergies and, in a broader scope, multi-modal interactions in alignment research.
翻译:多模态大语言模型(MLLM)基于大语言模型(LLM)进行训练,在理解多模态输入和生成文本响应方面具有更强的能力。尽管它们在多模态任务中表现出色,但MLLM的纯自然语言处理能力往往被低估且未得到充分检验。在本研究中,我们突破常规,揭示了MLLM一个引人注目的特性——初步结果表明,视觉指令微调(一种将LLM过渡到MLLM的常用策略)意外且有趣地帮助模型在纯NLP语境中同时提升了诚实性与伦理对齐能力。例如,经过视觉指令微调的LLaMA2 7B模型在TruthfulQA-mc和伦理基准测试上的表现,超越了通过超过百万条人工标注数据微调的LLaMA2-chat 7B模型。进一步分析表明,这种改进的对齐能力可归因于视觉-文本数据固有的优质指令质量。通过将代码开源至github.com/UCSC-VLAA/Sight-Beyond-Text,我们期望促进对视觉-文本协同内在价值以及更广泛范围内多模态交互在对齐研究中作用的进一步探索。