Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.
翻译:大型音频语言模型通常因缺乏专门的语料库而在处理本地化方言韵律时表现不佳。我们提出TW-Sound580K——一个通过验证-生成-批判协议构建的台湾地区音频-文本指令数据集。该流程采用双ASR验证机制过滤52.2万个原始音频片段,并借助教师模型将其扩展为58万条高保真指令对。通过Tai-LALM模型验证了该数据集的实用性——该模型以DeSTA 2.5音频初始化骨干网络进行微调,并引入动态双ASR仲裁策略优化推理阶段的转录选择。在TAU基准测试中,Tai-LALM达到49.1%的准确率,相较于零样本基线(基于ASR文本条件的42.6%)实现了6.5个百分点的绝对提升。这证实了将区域语料库与严格筛选及动态仲裁机制相结合,能显著提升大型音频语言模型在本地化语音任务上的性能。