Visual Question Answering (VQA) systems are notoriously brittle under distribution shifts and data scarcity. While previous solutions-such as ensemble methods and data augmentation-can improve performance in isolation, they fail to generalise well across in-distribution (IID), out-of-distribution (OOD), and low-data settings simultaneously. We argue that this limitation stems from the suboptimal training strategies employed. Specifically, treating all training samples uniformly-without accounting for question difficulty or semantic structure-leaves the models vulnerable to dataset biases. Thus, they struggle to generalise beyond the training distribution. To address this issue, we introduce Task-Progressive Curriculum Learning (TPCL)-a simple, model-agnostic framework that progressively trains VQA models using a curriculum built by jointly considering question type and difficulty. Specifically, TPCL first groups questions based on their semantic type (e.g., yes/no, counting) and then orders them using a novel Optimal Transport-based difficulty measure. Without relying on data augmentation or explicit debiasing, TPCL improves generalisation across IID, OOD, and low-data regimes and achieves state-of-the-art performance on VQA-CP v2, VQA-CP v1, and VQA v2. It outperforms the most competitive robust VQA baselines by over 5% and 7% on VQA-CP v2 and v1, respectively, and boosts backbone performance by up to 28.5%.
翻译:视觉问答(VQA)系统在分布偏移和数据稀缺情况下表现出显著的脆弱性。尽管先前解决方案(如集成方法和数据增强)能在孤立场景下提升性能,但它们无法同时泛化至同分布(IID)、异分布(OOD)和低数据场景。我们认为,这一局限性源于次优的训练策略——具体而言,若未考虑问题难度或语义结构而将所有训练样本等同对待,将使模型易受数据集偏差影响,从而难以在训练分布之外实现泛化。为解决该问题,我们提出“任务渐进课程学习”(TPCL)——一种简单且与模型无关的框架,通过联合考量问题类型与难度的课程设计,逐步训练VQA模型。具体而言,TPCL首先基于问题语义类型(如是否类、计数类)进行分组,随后利用基于最优传输的新型难度度量进行排序。无需依赖数据增强或显式去偏,TPCL显著提升了模型在IID、OOD及低数据场景下的泛化能力,并在VQA-CP v2、VQA-CP v1和VQA v2基准上达到当前最优性能。相较于最具竞争力的鲁棒VQA基线,其在VQA-CP v2和v1上分别提升超过5%和7%,且将骨干网络性能提升高达28.5%。