Question decomposition has emerged as an effective strategy for prompting Large Language Models (LLMs) to answer complex questions. However, while existing methods primarily focus on unimodal language models, the question decomposition capability of Multimodal Large Language Models (MLLMs) has yet to be explored. To this end, this paper explores visual question decomposition on MLLMs. Specifically, we introduce a systematic evaluation framework including a dataset and several evaluation criteria to assess the quality of the decomposed sub-questions, revealing that existing MLLMs struggle to produce high-quality sub-questions. To address this limitation, we propose a specific finetuning dataset, DecoVQA+, for enhancing the model's question decomposition capability. Aiming at enabling models to perform appropriate selective decomposition, we propose an efficient finetuning pipeline. The finetuning pipeline consists of our proposed dataset and a training objective for selective decomposition. Finetuned MLLMs demonstrate significant improvements in the quality of sub-questions and the policy of selective question decomposition. Additionally, the models also achieve higher accuracy with selective decomposition on VQA benchmark datasets.
翻译:问题分解已成为引导大语言模型(LLMs)回答复杂问题的有效策略。然而,现有方法主要关注单模态语言模型,多模态大语言模型(MLLMs)的问题分解能力尚未得到探索。为此,本文探索了MLLMs上的视觉问题分解。具体而言,我们引入了一个系统性评估框架,包括一个数据集和若干评估标准,用以评估分解后子问题的质量,结果表明现有MLLMs难以生成高质量的子问题。为应对这一局限,我们提出了一个特定的微调数据集DecoVQA+,以增强模型的问题分解能力。为使模型能够执行适当的选择性分解,我们提出了一种高效的微调流程。该流程包含我们提出的数据集以及一个用于选择性分解的训练目标。经微调的MLLMs在子问题质量与选择性问题分解策略上均展现出显著提升。此外,这些模型在VQA基准数据集上通过选择性分解也实现了更高的准确率。