This study explores innovative methods for improving Visual Question Answering (VQA) using Generative Adversarial Networks (GANs), autoencoders, and attention mechanisms. Leveraging a balanced VQA dataset, we investigate three distinct strategies. Firstly, GAN-based approaches aim to generate answer embeddings conditioned on image and question inputs, showing potential but struggling with more complex tasks. Secondly, autoencoder-based techniques focus on learning optimal embeddings for questions and images, achieving comparable results with GAN due to better ability on complex questions. Lastly, attention mechanisms, incorporating Multimodal Compact Bilinear pooling (MCB), address language priors and attention modeling, albeit with a complexity-performance trade-off. This study underscores the challenges and opportunities in VQA and suggests avenues for future research, including alternative GAN formulations and attentional mechanisms.
翻译:本研究探索利用生成对抗网络、自编码器及注意力机制改进视觉问答的创新方法。基于平衡的VQA数据集,我们研究了三种不同策略。首先,基于GAN的方法旨在生成以图像和问题输入为条件的答案嵌入,虽显示出潜力但在处理复杂任务时存在困难。其次,基于自编码器的技术专注于学习问题和图像的最优嵌入,由于在复杂问题上表现更佳,取得了与GAN相当的结果。最后,融合多模态紧凑双线性池化的注意力机制解决了语言先验和注意力建模问题,但存在复杂度与性能的权衡。本研究揭示了VQA领域的挑战与机遇,并为未来研究提出了方向,包括替代性GAN架构和注意力机制。