Underwater acoustic target recognition is an intractable task due to the complex acoustic source characteristics and sound propagation patterns. Limited by insufficient data and narrow information perspective, recognition models based on deep learning seem far from satisfactory in practical underwater scenarios. Although underwater acoustic signals are severely influenced by distance, channel depth, or other factors, annotations of relevant information are often non-uniform, incomplete, and hard to use. In our work, we propose to implement Underwater Acoustic Recognition based on Templates made up of rich relevant information (hereinafter called "UART"). We design templates to integrate relevant information from different perspectives into descriptive natural language. UART adopts an audio-spectrogram-text tri-modal contrastive learning framework, which endows UART with the ability to guide the learning of acoustic representations by descriptive natural language. Our experiments reveal that UART has better recognition capability and generalization performance than traditional paradigms. Furthermore, the pre-trained UART model could provide superior prior knowledge for the recognition model in the scenario without any auxiliary annotation.
翻译:水声目标识别是一项因复杂声源特性和声传播模式而极具挑战的任务。受限于数据不足和信息视角狭窄,基于深度学习的识别模型在实际水下场景中远未达到理想效果。尽管水声信号受距离、信道深度等因素严重影响,但相关信息的标注往往不统一、不完整且难以利用。本文提出基于丰富相关信息模板的水声识别方法(以下简称"UART")。我们设计模板将不同视角的相关信息整合为描述性自然语言。UART采用音频-声谱图-文本三模态对比学习框架,使其能够通过描述性自然语言引导声学表征的学习。实验表明,UART相较于传统范式具有更强的识别能力和泛化性能。此外,预训练后的UART模型可在无任何辅助标注的场景下为识别模型提供优越的先验知识。