Existing research on audio classification faces challenges in recognizing attributes of passive underwater vessel scenarios and lacks well-annotated datasets due to data privacy concerns. In this study, we introduce CLAPP (Contrastive Language-Audio Pre-training in Passive Underwater Vessel Classification), a novel model. Our aim is to train a neural network using a wide range of vessel audio and vessel state text pairs obtained from an oceanship dataset. CLAPP is capable of directly learning from raw vessel audio data and, when available, from carefully curated labels, enabling improved recognition of vessel attributes in passive underwater vessel scenarios. Model's zero-shot capability allows predicting the most relevant vessel state description for a given vessel audio, without directly optimizing for the task. Our approach aims to solve 2 challenges: vessel audio-text classification and passive underwater vessel audio attribute recognition. The proposed method achieves new state-of-the-art results on both Deepship and Shipsear public datasets, with a notable margin of about 7%-13% for accuracy compared to prior methods on zero-shot task.
翻译:现有音频分类研究在识别被动水下船只场景属性方面面临挑战,且因数据隐私问题缺乏良好标注的数据集。本研究提出CLAPP(被动水下船只分类中的对比语言-音频预训练)这一新型模型,旨在利用海洋船舶数据集中大量船只音频与船只状态文本对训练神经网络。CLAPP能够直接从原始船只音频数据中学习,并在可用时利用精心筛选的标签,从而提升被动水下船只场景中船只属性的识别能力。模型的零样本能力允许直接预测给定船只音频最相关的状态描述,无需针对特定任务进行优化。本研究方法旨在解决两大挑战:船只音频-文本分类与被动水下船只音频属性识别。所提方法在Deepship和Shipsear两个公开数据集上均取得了新的最优结果,零样本任务的准确率相较于先前方法有约7%-13%的显著提升。