This work explores the zero-shot capabilities of foundation models in Visual Question Answering (VQA) tasks. We propose an adaptive multi-agent system, named Multi-Agent VQA, to overcome the limitations of foundation models in object detection and counting by using specialized agents as tools. Unlike existing approaches, our study focuses on the system's performance without fine-tuning it on specific VQA datasets, making it more practical and robust in the open world. We present preliminary experimental results under zero-shot scenarios and highlight some failure cases, offering new directions for future research.
翻译:本研究探索了基础模型在视觉问答(VQA)任务中的零样本能力。我们提出了一种名为多智能体VQA的自适应多智能体系统,通过使用专门化的智能体作为工具,克服了基础模型在目标检测和计数方面的局限性。与现有方法不同,本研究关注系统在未针对特定VQA数据集进行微调情况下的性能,使其在开放世界中更具实用性和鲁棒性。我们展示了零样本场景下的初步实验结果,并指出了一些失败案例,为未来研究开辟了新的方向。