Contrastive audio-language models such as CLAP enable zero-shot audio classification: a sound is labelled by matching its embedding to text prompt embeddings, with no labelled audio. This matching breaks down under acoustic noise, where accuracy and mAP fall by 12-30 percentage points at 0 dB SNR on standard benchmarks. We propose Drift Augmented Scoring (DAS), a small per-class bonus added to the cosine score. The bonus rewards a class when the noisy audio embedding drifts in the direction that the class's noise-conditioned text prompts predict. It is derived from text alone, computed once and cached, and adds a single inner product per class at inference, with no gradients and no test-time batch. On a LAION CLAP backbone, we compare DAS against the four variants of Acevedo et al.'s concurrent method on UrbanSound8K and the full FSD50K eval set, mixing each clip with urban acoustic scene noise across a range of SNRs. DAS improves the metric on every test condition: by +2.60 to +5.75 accuracy points on UrbanSound8K and +1.50 to +1.74 mAP points on FSD50K.
翻译:对比音频-语言模型(如CLAP)能够实现零样本音频分类:通过将声音嵌入与文本提示嵌入进行匹配,即可对声音进行标注,无需标注音频。然而,在声学噪声环境下,这种匹配机制会失效:标准基准测试中,在0 dB信噪比下,准确率和mAP指标下降12-30个百分点。我们提出漂移增强评分(Drift Augmented Scoring, DAS),即对余弦评分添加一个微小的逐类奖励。当噪声音频嵌入沿着某类别噪声条件文本提示所预测的方向发生漂移时,该奖励会对该类别进行补偿。该奖励完全由文本推导得出,仅需一次性计算并缓存,推理时每个类别仅需一次内积运算,无需梯度计算,也无需测试时批处理。基于LAION CLAP骨干网络,我们在UrbanSound8K和完整FSD50K评估集上,将DAS与Acevedo等人同期方法的四个变体进行了比较,每个音频片段均与城市声学场景噪声混合,覆盖一系列信噪比。DAS在所有测试条件下均提升了指标:在UrbanSound8K上准确率提升+2.60至+5.75个百分点,在FSD50K上mAP提升+1.50至+1.74个百分点。