While Large Audio-Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference-based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language-Audio Pretraining (CLAP)-based approaches frequently overlook syntactic errors and fine-grained details. We propose CAF-Score, a reference-free metric that calibrates CLAP's coarse-grained semantic alignment with the fine-grained comprehension and syntactic awareness of LALMs. By combining contrastive audio-text embeddings with LALM reasoning, CAF-Score effectively detects syntactic inconsistencies and subtle hallucinations. Experiments on the BRACE benchmark demonstrate that our approach achieves the highest correlation with human judgments, even outperforming reference-based baselines in challenging scenarios. These results highlight the efficacy of CAF-Score for reference-free audio captioning evaluation. Code and results are available at https://github.com/inseong00/CAF-Score.
翻译:尽管大型音频语言模型推动了音频字幕技术的发展,但稳健的评估仍然困难。基于参考的指标成本高昂且往往无法评估声学保真度,而基于对比语言-音频预训练的模型常忽略句法错误与细粒度细节。我们提出CAF-Score——一种无参考评估指标,通过融合LALMs的细粒度理解与句法感知能力,对CLAP的粗粒度语义对齐进行校准。通过将对比音频-文本嵌入与LALM推理相结合,CAF-Score能有效检测句法不一致性与细微幻觉。在BRACE基准上的实验表明,该方法实现了与人工评判的最高相关性,甚至在具有挑战性的场景中超越了基于参考的基线方法。这些结果凸显了CAF-Score在无参考音频字幕评估中的有效性。代码与结果见https://github.com/inseong00/CAF-Score。