Injecting world knowledge into pretrained multimodal large language models (MLLMs) is essential for domain-specific applications. Task-specific fine-tuning achieves this by tailoring MLLMs to high-quality in-domain data but encounters scalability challenges as datasets grow, necessitating a trade-off between performance and computational overhead. Existing data selection methods rely on additional scoring models or heuristic clustering, failing to concentrate on both data importance and diversity. Moreover, both methods overlook the interplay among training samples. To address these limitations, we propose CLIPPER, a training-free data selection pipeline that separates parameter and world knowledge, and leverages in-context learning to probe model responses to different demonstration-query combinations. CLIPPER identifies coresets that mirror the original dataset's perplexity distribution, preserving critical samples while maintaining diversity. Experiments on two MLLMs and three datasets show that CLIPPER matches full fine-tuning performance with significantly lower costs: Qwen2.5-VL-7B attains 47% data efficiency on VRSBench, and Llama-3.2-11B-Vision-Instruct reduces ScienceQA training time by 37%.
翻译:将预训练多模态大模型(MLLMs)注入世界知识对领域特定应用至关重要。任务级微调通过将MLLMs适配至高质量领域内数据实现此目标,但随着数据集扩大面临可扩展性挑战,需在性能与计算开销间权衡。现有数据选择方法依赖额外评分模型或启发式聚类,未能兼顾数据重要性与多样性。此外,两类方法均忽略了训练样本间的交互作用。为解决上述局限,我们提出CLIPPER——一种无需训练的端到端数据选择流程,该流程将参数与世界知识分离,并利用上下文学习探测模型对不同演示-查询组合的响应。CLIPPER识别出能反映原始数据集困惑度分布的核心子集,在保留关键样本的同时维持多样性。基于两个MLLMs与三个数据集的实验表明:CLIPPER能以显著更低的计算成本达到全量微调性能——Qwen2.5-VL-7B在VRSBench上实现47%数据效率提升,Llama-3.2-11B-Vision-Instruct在ScienceQA上的训练时间减少37%。