Audio question answering (AQA) is the task of producing natural language answers when a system is provided with audio and natural language questions. In this paper, we propose neural network architectures based on self-attention and cross-attention for the AQA task. The self-attention layers extract powerful audio and textual representations. The cross-attention maps audio features that are relevant to the textual features to produce answers. All our models are trained on the recently proposed Clotho-AQA dataset for both binary yes/no questions and single-word answer questions. Our results clearly show improvement over the reference method reported in the original paper. On the yes/no binary classification task, our proposed model achieves an accuracy of 68.3% compared to 62.7% in the reference model. For the single-word answers multiclass classifier, our model produces a top-1 and top-5 accuracy of 57.9% and 99.8% compared to 54.2% and 93.7% in the reference model respectively. We further discuss some of the challenges in the Clotho-AQA dataset such as the presence of the same answer word in multiple tenses, singular and plural forms, and the presence of specific and generic answers to the same question. We address these issues and present a revised version of the dataset.
翻译:音频问答(AQA)是一项任务,要求系统在提供音频和自然语言问题后生成自然语言答案。本文针对AQA任务提出了基于自注意力和交叉注意力的神经网络架构。自注意力层可提取强大的音频和文本表征,而交叉注意力则将与文本特征相关的音频特征进行映射以生成答案。我们所有模型均在近期提出的Clotho-AQA数据集上训练,涵盖二值是与否问题以及单词答案问题。实验结果表明,我们的模型较原始论文中的参考方法有显著提升。在是与否二值分类任务中,所提模型准确率达68.3%,而参考模型仅为62.7%。在单词答案多分类任务中,所提模型的top-1和top-5准确率分别为57.9%和99.8%,而参考模型分别为54.2%和93.7%。我们进一步讨论了Clotho-AQA数据集中的若干挑战,例如同一答案词存在多种时态、单复数形式,以及同一问题存在特定答案与通用答案并存的情况。针对这些问题,我们提出解决方案并发布了数据集的修订版本。