Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a framework that shifts curation from task-specific heuristics to intrinsic capability analysis. CADC discovers intrinsic capabilities in an unsupervised manner from gradient-based learning trajectories, attributes training data to these capabilities via influence estimation, and curates capability-aware curricula through balanced selection and staged sequencing. This transforms black-box instruction tuning into a controllable, capability-driven process. With as little as 5% of the original data, CADC surpasses full-data training on multimodal benchmarks. These results validate intrinsic capabilities as the fundamental building blocks of model learning and establish CADC as a principle paradigm for instruction data curation.
翻译:大型视觉语言模型(VLMs)在基准测试中展现出强大性能,但通过指令微调控制其行为仍具挑战性。减少指令微调数据集的预算常导致性能衰退,因为启发式策略将模型视为黑箱,忽视了主导学习的潜在能力。本文提出能力归因数据筛选(CADC)框架,将数据筛选范式从任务特定启发式方法转向内在能力分析。CADC通过基于梯度的学习轨迹以无监督方式发现内在能力,借助影响估计将训练数据归因于这些能力,并通过平衡选择与分阶段排序构建能力感知的课程学习方案。该方法将黑箱式指令微调转化为可控的、能力驱动的过程。仅使用原始数据5%的情况下,CADC在多模态基准测试中超越了全数据训练效果。这些结果验证了内在能力作为模型学习基本构建单元的理论假设,并确立了CADC作为指令数据筛选的核心范式。