Recent research has demonstrated that the multi-task fine-tuning of multi-modal Large Language Models (LLMs) using an assortment of annotated downstream vision-language datasets significantly enhances their performance. Yet, during this process, a side effect, which we termed as the "multi-modal alignment tax", surfaces. This side effect negatively impacts the model's ability to format responses appropriately -- for instance, its "politeness" -- due to the overly succinct and unformatted nature of raw annotations, resulting in reduced human preference. In this paper, we introduce Polite Flamingo, a multi-modal response rewriter that transforms raw annotations into a more appealing, "polite" format. Polite Flamingo is trained to reconstruct high-quality responses from their automatically distorted counterparts and is subsequently applied to a vast array of vision-language datasets for response rewriting. After rigorous filtering, we generate the PF-1M dataset and further validate its value by fine-tuning a multi-modal LLM with it. Combined with novel methodologies including U-shaped multi-stage tuning and multi-turn augmentation, the resulting model, Clever Flamingo, demonstrates its advantages in both multi-modal understanding and response politeness according to automated and human evaluations.
翻译:近期研究表明,利用多种标注的下游视觉-语言数据集对多模态大语言模型进行多任务微调,可显著提升其性能。然而,在此过程中会出现一种我们称之为"多模态对齐代价"的副作用——由于原始标注过于简略且缺乏格式规范,该副作用会负面影响模型恰当格式化响应的能力(例如其"礼貌性"),导致人类偏好度降低。本文提出礼貌火烈鸟(Polite Flamingo),一种可将原始标注转化为更具吸引力"礼貌"格式的多模态响应重写器。该模型通过从自动扰乱的对应响应中重建高质量响应进行训练,随后被应用于海量视觉-语言数据集的响应重写。经严格过滤后,我们生成了PF-1M数据集,并通过基于该数据集微调多模态大语言模型进一步验证其价值。结合U型多阶段微调与多轮数据增强等创新方法,所得到的聪明火烈鸟(Clever Flamingo)模型在自动评估与人工评估中均展现出多模态理解与响应礼貌性的双重优势。