In this paper, we propose a submission to the x-to-audio alignment (XACLE) challenge. The goal is to predict semantic alignment of a given general audio and text pair. The proposed system is based on a large audio language model (LALM) architecture. We employ a three-stage training pipeline: automated audio captioning pretraining, pretraining with CLAP pseudo-labels, and fine-tuning on the XACLE dataset. Our experiments show that pretraining with CLAP pseudo-labels is the primary performance driver. On the XACLE test set, our system reaches an SRCC of 0.632, significantly outperforming the baseline system (0.334) and securing third place in the challenge team ranking. Code and models can be found at https://github.com/shiotalab-tmu/tmu-xacle2026
翻译:本文针对跨模态音频对齐(XACLE)挑战赛提出参赛方案,其核心目标在于预测给定通用音频与文本对之间的语义对齐关系。所提出的系统基于大规模音频语言模型架构,采用三阶段训练流程:自动化音频描述预训练、基于CLAP伪标签的预训练,以及在XACLE数据集上的微调。实验结果表明,采用CLAP伪标签的预训练阶段是提升性能的关键驱动力。在XACLE测试集上,本系统达到0.632的斯皮尔曼等级相关系数,显著超越基线系统(0.334),并在挑战赛团队排名中位列第三。相关代码与模型已发布于https://github.com/shiotalab-tmu/tmu-xacle2026。