Humans can retain old knowledge while learning new information, but Large Language Models (LLMs) often suffer from catastrophic forgetting when post-pretrained or supervised fine-tuned (SFT) on domain-specific data. Moreover, for Multimodal Large Language Models (MLLMs) which are composed of the LLM base and visual projector (e.g. LLaVA), a significant decline in performance on language benchmarks was observed compared to their single-modality counterparts. To address these challenges, we introduce a novel model-agnostic self-decompression method, Tree Generation (TG), that decompresses knowledge within LLMs into the training corpus. This paper focuses on TG-SFT, which can synthetically generate SFT data for the instruction tuning steps. By incorporating the dumped corpus during SFT for MLLMs, we significantly reduce the forgetting problem.
翻译:人类能够在学习新信息的同时保留旧知识,但大型语言模型(LLMs)在领域特定数据上进行后预训练或有监督微调(SFT)时,常常遭受灾难性遗忘。此外,对于由LLM基座和视觉投影器(如LLaVA)组成的多模态大型语言模型(MLLMs),其在语言基准测试上的性能相比单模态模型显著下降。为解决这些挑战,我们提出了一种新颖的模型无关自解压方法——树生成(Tree Generation, TG),该方法将LLM内部的知识解压为训练语料。本文聚焦于TG-SFT,它能够为指令微调步骤合成生成SFT数据。通过在MLLMs的SFT过程中融合生成的语料,我们显著缓解了遗忘问题。