In this paper, we present our solution for the WSDM2023 Toloka Visual Question Answering Challenge. Inspired by the application of multimodal pre-trained models to various downstream tasks(e.g., visual question answering, visual grounding, and cross-modal retrieval), we approached this competition as a visual grounding task, where the input is an image and a question, guiding the model to answer the question and display the answer as a bounding box on the image. We designed a three-stage solution for this task. Specifically, we used the visual-language pre-trained model OFA as the foundation. In the first stage, we constructed a large-scale synthetic dataset similar to the competition dataset and coarse-tuned the model to learn generalized semantic information. In the second stage, we treated the competition task as a visual grounding task, loaded the weights from the previous stage, and continued to fine-tune the model on the competition dataset, transferring the semantic information learned in the first stage to the competition task. Finally, we designed a bounding box matching and replacing post-processing strategy to correct the model's prediction results. Our team achieved a score of 76.342 on the final leaderboard, ranking second.
翻译:本文介绍了我们在WSDM2023 Toloka视觉问答挑战赛中的解决方案。受多模态预训练模型在下游任务(如视觉问答、视觉定位和跨模态检索)中成功应用的启发,我们将该竞赛任务视为视觉定位问题:输入为图像和问题,引导模型回答问题并将答案以边界框形式标注在图像上。我们为此设计了三阶段解决方案:首先以视觉语言预训练模型OFA为基础,构建与竞赛数据集相似的大规模合成数据集进行粗调,使模型学习泛化语义信息;其次将竞赛任务作为视觉定位任务,加载前一阶段权重并在竞赛数据集上继续微调,实现语义知识迁移;最后设计边界框匹配替换的后处理策略以修正模型预测结果。我们的团队在最终排行榜上获得76.342分,位列第二名。