The dominant paradigm for Audio-Text Retrieval (ATR) relies on dual-encoder architectures optimized via mini-batch contrastive learning. However, restricting optimization to local in-batch samples creates a fundamental limitation we term the Gradient Locality Bottleneck (GLB), which prevents the resolution of acoustic ambiguities and hinders the learning of rare long-tail concepts. While external knowledge injection can break this bottleneck, it often triggers a problem called Representation-Drift Mismatch (RDM), where a static knowledge base becomes misaligned with evolving encoders, degrading guidance into noise. To address these intertwined challenges, we propose the Adaptive Self-improving Knowledge (ASK) framework. ASK breaks the GLB via multi-grained knowledge injection and mitigates RDM through a dynamic refinement strategy that synchronizes the knowledge base with the model. Additionally, an adaptive reliability weighting scheme is employed to filter retrieval noise based on cross-modal consistency. Extensive experiments across multiple benchmarks demonstrate that ASK consistently achieves new state-of-the-art performance across various backbones.
翻译:音频文本检索(ATR)的主流范式依赖于通过小批量对比学习优化的双编码器架构。然而,将优化限制于局部批次样本会导致一个根本性限制,我们称之为梯度局部瓶颈(GLB),这阻碍了声学歧义性的解决以及罕见长尾概念的学习。虽然外部知识注入可以打破这一瓶颈,但往往会引发一个称为表征漂移失配(RDM)的问题,即静态知识库与不断演化的编码器之间产生不对齐,从而将指导退化为噪声。为应对这些相互交织的挑战,我们提出了自适应自提升知识(ASK)框架。ASK通过多粒度知识注入打破GLB,并采用动态精化策略来缓解RDM,使知识库与模型同步更新。此外,还引入了一种基于跨模态一致性的自适应可靠性加权机制,以过滤检索噪声。在多个基准上的大量实验表明,ASK能够在各种骨干网络上持续取得新的最优性能。