Text-to-audio retrieval has made significant progress with shared embedding models such as CLAP and Pengi, yet they often struggle with fine-grained semantic alignment due to the inherent modality gap between text and audio. In this work, we propose FORTE, a unified framework that integrates structured logical reasoning with parameter-efficient cross-modal alignment to improve retrieval precision. Our approach first transforms queries into first-order logic and refines them via a constrained search that preserves semantic invariance while introducing discriminative attributes. The refined representation is then aligned with audio embeddings using a lightweight projection module, followed by a predicate-aware re-ranking step that enforces logical consistency at inference. Extensive experiments on AudioCaps and Clotho demonstrate consistent improvements over strong baselines, particularly in challenging fine-grained scenarios. Our results highlight the effectiveness of combining symbolic reasoning with representation learning for cross-modal retrieval.
翻译:文本到音频检索在共享嵌入模型(如CLAP和Pengi)方面取得了显著进展,但由于文本与音频之间固有的模态差距,这些方法在细粒度语义对齐上仍面临挑战。本文提出FORTE,一种统一框架,通过将结构化逻辑推理与参数高效的跨模态对齐相结合,提升检索精度。该方法首先将查询转换为一阶逻辑,并通过保留语义不变性同时引入判别属性的约束搜索对其进行精炼。随后,利用轻量级投影模块将精炼后的表示与音频嵌入对齐,并引入谓词感知的重排序步骤,在推理阶段强制实现逻辑一致性。在AudioCaps和Clotho上的大量实验表明,该方法在强基线基础上实现了一致性改进,尤其在具有挑战性的细粒度场景中效果显著。我们的结果凸显了将符号推理与表示学习相结合对跨模态检索的有效性。