In this report, we present our champion solution to the WSDM2023 Toloka Visual Question Answering (VQA) Challenge. Different from the common VQA and visual grounding (VG) tasks, this challenge involves a more complex scenario, i.e. inferring and locating the object implicitly specified by the given interrogative question. For this task, we leverage ViT-Adapter, a pre-training-free adapter network, to adapt multi-modal pre-trained Uni-Perceiver for better cross-modal localization. Our method ranks first on the leaderboard, achieving 77.5 and 76.347 IoU on public and private test sets, respectively. It shows that ViT-Adapter is also an effective paradigm for adapting the unified perception model to vision-language downstream tasks. Code and models will be released at https://github.com/czczup/ViT-Adapter/tree/main/wsdm2023.
翻译:本报告介绍了我们在WSDM2023 Toloka视觉问答挑战赛中的冠军方案。与常见的VQA和视觉定位任务不同,该挑战涉及更复杂的场景,即推断并定位问句中隐含指定的目标对象。针对此任务,我们利用无需预训练的适配网络ViT-Adapter,对多模态预训练模型Uni-Perceiver进行适配,以实现更优的跨模态定位。我们的方法在排行榜上位列第一,在公开测试集和私有测试集上分别达到了77.5和76.347的IoU。这表明ViT-Adapter也是将统一感知模型适配到视觉-语言下游任务的有效范式。代码与模型将发布在https://github.com/czczup/ViT-Adapter/tree/main/wsdm2023。