Multi-modal contrastive learning techniques in the audio-text domain have quickly become a highly active area of research. Most works are evaluated with standard audio retrieval and classification benchmarks assuming that (i) these models are capable of leveraging the rich information contained in natural language, and (ii) current benchmarks are able to capture the nuances of such information. In this work, we show that state-of-the-art audio-text models do not yet really understand natural language, especially contextual concepts such as sequential or concurrent ordering of sound events. Our results suggest that existing benchmarks are not sufficient to assess these models' capabilities to match complex contexts from the audio and text modalities. We propose a Transformer-based architecture and show that, unlike prior work, it is capable of modeling the sequential relationship between sound events in the text and audio, given appropriate benchmark data. We advocate for the collection or generation of additional, diverse, data to allow future research to fully leverage natural language for audio-text modeling.
翻译:多模态对比学习技术在音频-文本领域迅速成为研究热点。多数研究采用标准音频检索与分类基准进行评估,其隐含假设包括:(i) 这些模型能够利用自然语言中蕴含的丰富信息;(ii) 现有基准能够捕捉此类信息的细微差异。本研究证明,当前最优的音频-文本模型尚未真正理解自然语言,尤其难以把握声音事件时序或并发排序等上下文概念。结果表明,现有基准不足以评估模型跨模态匹配复杂上下文的能力。我们提出一种基于Transformer的架构,并证明其与现有工作不同,在具备适当基准数据的情况下,能够对文本与音频中声音事件的时序关系进行建模。我们主张收集或生成更多样化的数据,以期未来研究能充分利用自然语言进行音频-文本建模。