Underwater acoustic target recognition is an intractable task due to the complex acoustic source characteristics and sound propagation patterns. Limited by insufficient data and narrow information perspective, recognition models based on deep learning seem far from satisfactory in practical underwater scenarios. Although underwater acoustic signals are severely influenced by distance, channel depth, or other factors, annotations of relevant information are often non-uniform, incomplete, and hard to use. In our work, we propose to implement Underwater Acoustic Recognition based on Templates made up of rich relevant information (hereinafter called "UART"). We design templates to integrate relevant information from different perspectives into descriptive natural language. UART adopts an audio-spectrogram-text tri-modal contrastive learning framework, which endows UART with the ability to guide the learning of acoustic representations by descriptive natural language. Our experiments reveal that UART has better recognition capability and generalization performance than traditional paradigms. Furthermore, the pre-trained UART model could provide superior prior knowledge for the recognition model in the scenario without any auxiliary annotation.
翻译:水下声学目标识别是一项棘手的任务,原因在于复杂的声源特性与声传播模式。受限于数据不足和信息视角狭窄,基于深度学习的识别模型在实际水下场景中表现远未达到令人满意的水平。尽管水下声信号严重受距离、信道深度或其他因素影响,但相关信息的标注往往不统一、不完整且难以利用。在本文中,我们提出基于丰富相关信息模板的水下声学识别方法(以下简称“UART”)。我们设计模板,将不同视角的相关信息整合为描述性自然语言。UART采用音频-频谱图-文本三模态对比学习框架,使其具备通过描述性自然语言引导声学表征学习的能力。实验表明,UART在识别能力和泛化性能上均优于传统范式。此外,预训练的UART模型能在无任何辅助标注的场景下,为识别模型提供优越的先验知识。