The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs. These include handling of variable duration scenes, and scenes built with elementary sounds that differ between training and test set. We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs. The use of 1D convolutions in time and frequency to process 2D spectro-temporal representations of acoustic content shows promising results and enables reductions in model complexity. We show that time coordinate maps augment temporal localization capabilities which enhance performance of the network by ~17 percentage points. On the other hand, frequency coordinate maps have little influence on this task. NAAQA achieves 79.5% of accuracy on the AQA task with ~4 times fewer parameters than the previously explored VQA model. We evaluate the perfomance of NAAQA on an independent data set reconstructed from DAQA. We also test the addition of a MALiMo module in our model on both CLEAR2 and DAQA. We provide a detailed analysis of the results for the different question types. We release the code to produce CLEAR2 as well as NAAQA to foster research in this newly emerging machine learning task.
翻译:声学问答(AQA)任务的目标是回答关于声学场景内容的自由形式文本问题。该任务受视觉问答(VQA)任务的启发。本文基于先前提出的CLEAR数据集,为AQA提出了一个新的基准测试集CLEAR2,该基准强调了声学输入所面临的具体挑战,包括处理可变时长场景,以及训练集和测试集中基础声音元素构成场景的差异。我们还引入了NAAQA,这是一种利用声学输入特定属性的神经架构。使用一维卷积在时间和频率维度上处理声学内容的二维谱图表示,取得了有前景的结果,并降低了模型复杂度。我们表明,时间坐标图增强了时间定位能力,使网络性能提升约17个百分点;而频率坐标图对此任务影响甚微。NAAQA在AQA任务上达到了79.5%的准确率,同时参数量约为先前VQA模型的四分之一。我们在从DAQA重建的独立数据集上评估了NAAQA的性能,还在CLEAR2和DAQA上测试了模型中新增的MALiMo模块。我们针对不同问题类型的结果进行了详细分析。我们公开了生成CLEAR2和NAAQA的代码,以促进这一新兴机器学习任务的研究。