Developing culturally grounded multilingual AI systems remains challenging, particularly for low-resource languages. While synthetic data offers promise, its effectiveness in multilingual and multicultural contexts is underexplored. We investigate bottom-up synthetic data generation using large open-source LLMs (>= 235B parameters) grounded in language-specific Wikipedia content, complementing dominant top-down translation-based approaches from English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages and English, encompassing diverse reasoning and generative tasks. Comprehensive evaluation using automated metrics and 10K human assessments confirms high data quality. Downstream evaluations performed by fine-tuning models on various datasets and assessing performance across 13 diverse multilingual datasets and model comparative evaluations, demonstrate that models trained on Updesh consistently obtain significant improvements on NLU, NLG evaluations. Finally, through ablation studies and cultural evaluations, we show that context-aware, culturally grounded data generation is essential for effective multilingual AI development .
翻译:开发文化接地的多语言人工智能系统仍然具有挑战性,特别是对于低资源语言。虽然合成数据展现出前景,但其在多语言和多元文化背景下的有效性尚未得到充分探索。我们研究了基于语言特定维基百科内容、利用大型开源LLM(参数≥235B)进行自底向上的合成数据生成,以补充当前主流的、基于英语的自顶向下翻译方法。我们介绍了Updesh,这是一个高质量的大规模合成指令跟随数据集,包含跨越13种印度语言和英语的950万个数据点,涵盖多样化的推理和生成任务。使用自动化指标和一万次人工评估进行的全面评估证实了数据的高质量。通过对不同数据集进行模型微调,并在13个多样化的多语言数据集上评估性能以及进行模型比较评估,下游评估表明,在Updesh上训练的模型在自然语言理解和自然语言生成评估中持续获得显著提升。最后,通过消融研究和文化评估,我们证明了情境感知、文化接地的数据生成对于有效的多语言人工智能开发至关重要。