Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia data.
翻译:传统问答系统主要依赖结构化文本数据,但多媒体内容(图像、音频、视频及结构化元数据)的快速增长为检索增强型问答带来了新的挑战与机遇。本文综述了集成多媒体检索流程的问答系统最新进展,重点关注将视觉、语言和听觉模态与用户查询对齐的架构。我们根据检索方法、融合技术和答案生成策略对现有方法进行分类,并分析基准数据集、评估协议和性能权衡。此外,我们重点探讨了跨模态对齐、延迟-准确度权衡及语义基础等关键挑战,并展望了利用多媒体数据构建更鲁棒、更具上下文感知能力的问答系统所面临的开放性问题及未来研究方向。