Parameter-efficient fine-tuning multimodal large language models (MLLMs) presents significant challenges, including reliance on high-level visual features that limit fine-grained detail comprehension, and data conflicts that arise from task complexity. To address these issues, we propose an efficient fine-tuning framework with two novel approaches: Vision Cue Enhancement (VCE) and Dual Low-Rank Adaptation (Dual-LoRA). VCE enhances the vision projector by integrating multi-level visual cues, improving the model's ability to capture fine-grained visual features. Dual-LoRA introduces a dual low-rank structure for instruction tuning, decoupling learning into skill and task spaces to enable precise control and efficient adaptation across diverse tasks. Our method simplifies implementation, enhances visual comprehension, and improves adaptability. Experiments on both downstream tasks and general benchmarks demonstrate the effectiveness of our proposed approach.
翻译:参数高效微调多模态大语言模型(MLLMs)面临显著挑战,包括依赖高级视觉特征限制了细粒度细节理解,以及任务复杂性导致的数据冲突。为解决这些问题,我们提出一种高效微调框架,包含两种新颖方法:视觉线索增强(VCE)与双重低秩适配(Dual-LoRA)。VCE通过整合多层级视觉线索增强视觉投影器,提升模型捕获细粒度视觉特征的能力。Dual-LoRA为指令微调引入双重低秩结构,将学习解耦为技能空间与任务空间,以实现跨多样化任务的精确控制与高效适配。我们的方法简化了实现流程,增强了视觉理解能力,并提升了模型适应性。在下游任务与通用基准测试上的实验验证了所提方法的有效性。