Audio large language models (LLMs) enable unified speech understanding and generation, yet their adaptation to linguistically complex, dialect-rich settings remains underexplored. This paper presents the first systematic study of multi-task instruction tuning for an Arabic-centric audio LLM, covering a hierarchy of generative tasks (ASR, speech summarization) and discriminative tasks (dialect and emotion identification). To support this study, we introduce AraMega-SSum, a novel dataset for Arabic speech summarization. We fine-tune Qwen2.5-Omni (7B) and propose Task-Progressive Curriculum (TPC) along with Aligner-Based Diverse Sampling (ADS), a strategy that constructs information-dense batches by selecting task- and label-balanced examples. Our results reveal a critical efficiency, robustness trade-off: while ADS accelerates initial convergence and boosts paralinguistic F1-scores, its inherent gradient volatility can destabilize generative decoding under prolonged training. Furthermore, while the TPC stabilizes core acoustic mapping, it often induces negative transfer in downstream tasks. We demonstrate that a Hybrid TPC+ADS Strategy provides an optimal training ``recipe'', first establishing a robust representative foundation before employing diversity-aware refinement to capture fine-grained nuances. These findings offer practical guidance for the efficient adaptation of Omni-models in complex, low-resource multimodal environments.
翻译:音频大语言模型(LLMs)实现了统一的语音理解与生成,但其在语言复杂、方言丰富的环境中的适应性仍有待深入探索。本文首次对以阿拉伯语为中心的音频大语言模型进行了多任务指令微调的系统性研究,涵盖生成式任务(自动语音识别、语音摘要)和判别式任务(方言与情感识别)的层级结构。为支持本研究,我们引入了AraMega-SSum——一个用于阿拉伯语语音摘要的新型数据集。我们微调了Qwen2.5-Omni(7B)模型,并提出了任务渐进式课程(TPC)及基于对齐器的多样化采样(ADS)策略,该策略通过选择任务和标签均衡的样本来构建信息密集的批次。我们的结果揭示了一个关键的效率与鲁棒性权衡:虽然ADS加速了初始收敛并提升了副语言F1分数,但其固有的梯度波动性可能在长时间训练下破坏生成式解码的稳定性。此外,尽管TPC稳定了核心声学映射,却常在下游任务中引发负迁移。我们证明,混合TPC+ADS策略提供了一种最优的训练“配方”,即先建立稳健的表征基础,再采用多样性感知的精细化训练以捕捉细粒度差异。这些发现为Omni模型在复杂、低资源多模态环境中的高效适应提供了实用指导。