Evian: Towards Explainable Visual Instruction-tuning Data Auditing

The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.

翻译：中文摘要：大型视觉语言模型（LVLMs）的性能高度依赖于其训练数据的质量，需要在视觉保真度与指令遵循能力之间实现精确平衡。然而，现有数据集普遍存在质量不一致的问题，当前的数据筛选方法依赖于粗粒度评分，难以识别逻辑谬误或事实错误等细微语义缺陷。这成为开发更可靠模型的关键瓶颈。为解决这一问题，我们做出三项核心贡献。首先，通过系统性注入多样化细微缺陷，构建了一个包含30万样本的大规模基准测试集，为数据审计提供具有挑战性的测试平台。其次，提出创新的"分解-评估"范式，将模型响应拆解为视觉描述、主观推断和事实主张等认知组件，实现针对性分析。第三，通过EVIAN（可解释视觉指令微调数据审计）框架实现该范式，该自动化框架沿图像-文本一致性、逻辑连贯性和事实准确性三个正交维度对上述组件进行评估。实验结果表明，基于EVIAN精选的高质量子集微调的模型，其性能持续优于使用数量级更大数据集训练的模型，挑战了当前盛行的"规模至上"范式。研究还揭示，将复杂审计任务分解为可验证子任务能实现稳健的数据筛选，而逻辑连贯性是数据质量评估中最关键的因素。