As large language models (LLMs) see increasing adoption across the globe, it is imperative for LLMs to be representative of the linguistic diversity of the world. India is a linguistically diverse country of 1.4 Billion people. To facilitate research on multilingual LLM evaluation, we release IndicGenBench - the largest benchmark for evaluating LLMs on user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families. IndicGenBench is composed of diverse generation tasks like cross-lingual summarization, machine translation, and cross-lingual question answering. IndicGenBench extends existing benchmarks to many Indic languages through human curation providing multi-way parallel evaluation data for many under-represented Indic languages for the first time. We evaluate a wide range of proprietary and open-source LLMs including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM and LLaMA on IndicGenBench in a variety of settings. The largest PaLM-2 models performs the best on most tasks, however, there is a significant performance gap in all languages compared to English showing that further research is needed for the development of more inclusive multilingual language models. IndicGenBench is released at www.github.com/google-research-datasets/indic-gen-bench
翻译:随着大语言模型在全球范围内的应用日益广泛,确保其能够代表世界语言多样性变得至关重要。印度是一个拥有14亿人口、语言多样性极为丰富的国家。为促进多语言大语言模型评估研究,我们发布了IndicGenBench——这是目前规模最大的基准测试,用于评估大语言模型在29种印度语言(涵盖13种文字体系和4个语系)面向用户生成任务上的表现。IndicGenBench包含跨语言摘要、机器翻译和跨语言问答等多种生成任务。该基准通过人工标注将现有评估体系扩展到众多印度语言,首次为许多代表性不足的印度语言提供了多向平行评估数据。我们在多种实验设置下评估了包括GPT-3.5、GPT-4、PaLM-2、mT5、Gemma、BLOOM和LLaMA在内的广泛专有及开源大语言模型。最大的PaLM-2模型在多数任务中表现最佳,但所有印度语言与英语之间仍存在显著性能差距,这表明需要进一步研究以开发更具包容性的多语言模型。IndicGenBench已发布于www.github.com/google-research-datasets/indic-gen-bench。