Large Language Models (LLMs) are becoming increasingly multilingual, supporting hundreds of languages, especially high resource ones. Unfortunately, Dialect variations are still underrepresented due to limited data and linguistic variation. In this work, we adapt a pre-trained LLM to improve dialectal performance. Specifically, we use Low Rank Adaptation (LoRA) fine-tuning on monolingual and English Dialect parallel data, adapter merging and dialect-aware MBR decoding to improve dialectal fidelity generation and translation. Experiments on Syrian, Moroccan, and Saudi Arabic show that merging and MBR improve dialectal fidelity while preserving semantic accuracy. This combination provides a compact and effective framework for robust dialectal Arabic generation.
翻译:大语言模型正日益趋向多语言化,已能支持数百种语言,尤其是高资源语言。然而,由于数据稀缺和语言变体复杂性,方言变体在模型中仍未能得到充分表征。本研究通过适配预训练大语言模型以提升其方言处理性能。具体而言,我们采用低秩自适应技术对单语及英语-方言平行数据进行微调,结合适配器融合与方言感知最小贝叶斯风险解码策略,以提升方言忠实度生成与翻译质量。在叙利亚、摩洛哥及沙特阿拉伯方言上的实验表明,融合方法与最小贝叶斯风险解码能在保持语义准确性的同时显著提升方言忠实度。该组合方案为构建鲁棒的阿拉伯语方言生成系统提供了一个紧凑而有效的框架。