Visual instruction tuning is a key training stage of large multimodal models (LMMs). Nevertheless, the common practice of indiscriminately mixing instruction-following data from various tasks may result in suboptimal overall performance due to different instruction formats and knowledge domains across tasks. To mitigate this issue, we propose a novel Comprehensive Task Balancing (CoTBal) algorithm for multi-task visual instruction tuning of LMMs. To our knowledge, this is the first work that explores multi-task optimization in visual instruction tuning. Specifically, we consider two key dimensions for task balancing: (1) Inter-Task Contribution, the phenomenon where learning one task potentially enhances the performance in other tasks, attributable to the overlapping knowledge domains, and (2) Intra-Task Difficulty, which refers to the learning difficulty within a single task. By quantifying these two dimensions with performance-based metrics, task balancing is thus enabled by assigning more weights to tasks that offer substantial contributions to others, receive minimal contributions from others, and also have great intra-task difficulties. Experiments show that our CoTBal leads to superior overall performance in multi-task visual instruction tuning.
翻译:视觉指令微调是大规模多模态模型(LMMs)的关键训练阶段。然而,由于不同任务具有差异化的指令格式和知识领域,对来自多个任务的指令遵循数据进行无区别混合的常见做法可能导致整体性能次优。为缓解这一问题,我们提出了一种新颖的综合任务平衡(CoTBal)算法,用于LMMs的多任务视觉指令微调。据我们所知,这是首个在视觉指令微调中探索多任务优化的研究工作。具体而言,我们考虑任务平衡的两个关键维度:(1)任务间贡献,即由于知识领域重叠,学习某个任务可能提升其他任务性能的现象;(2)任务内难度,即单个任务内部的学习难度。通过基于性能的指标量化这两个维度,任务平衡得以实现——将更高的权重分配给对其他任务贡献显著、从其他任务获得贡献较少且具有较大任务内难度的任务。实验表明,我们的CoTBal在多任务视觉指令微调中实现了更优的整体性能。