This study explores innovative methods for improving Visual Question Answering (VQA) using Generative Adversarial Networks (GANs), autoencoders, and attention mechanisms. Leveraging a balanced VQA dataset, we investigate three distinct strategies. Firstly, GAN-based approaches aim to generate answer embeddings conditioned on image and question inputs, showing potential but struggling with more complex tasks. Secondly, autoencoder-based techniques focus on learning optimal embeddings for questions and images, achieving comparable results with GAN due to better ability on complex questions. Lastly, attention mechanisms, incorporating Multimodal Compact Bilinear pooling (MCB), address language priors and attention modeling, albeit with a complexity-performance trade-off. This study underscores the challenges and opportunities in VQA and suggests avenues for future research, including alternative GAN formulations and attentional mechanisms.
翻译:本研究探讨了利用生成对抗网络(GANs)、自编码器和注意力机制改进视觉问答(VQA)的创新方法。基于平衡的VQA数据集,我们研究了三种不同策略。首先,基于GAN的方法旨在生成基于图像和问题输入的条件答案嵌入,显示出潜力但在处理更复杂任务时表现不足。其次,基于自编码器的技术专注于学习问题和图像的最优嵌入,由于在复杂问题上具有更好的能力,取得了与GAN相当的结果。最后,采用多模态紧凑双线性池化(MCB)的注意力机制处理语言先验和注意力建模,尽管存在复杂度与性能的权衡。本研究强调了VQA中的挑战与机遇,并为未来研究(包括替代性GAN公式和注意力机制)提出了方向。