Large language models (LLMs) are commonly used for long-form question answering, which requires them to generate paragraph-length answers to complex questions. While long-form QA has been well-studied in English via many different datasets and evaluation metrics, this research has not been extended to cover most other languages. To bridge this gap, we introduce CaLMQA, a collection of 2.6K complex questions spanning 23 languages, including under-resourced, rarely-studied languages such as Fijian and Kirundi. Our dataset includes both naturally-occurring questions collected from community web forums as well as questions written by native speakers, whom we hire for this purpose. Our process yields diverse, complex questions that reflect cultural topics (e.g. traditions, laws, news) and the language usage of native speakers. We conduct automatic evaluation across a suite of open- and closed-source models using our novel metric CaLMScore, which detects incorrect language and token repetitions in answers, and observe that the quality of LLM-generated answers degrades significantly for some low-resource languages. We perform human evaluation on a subset of models and see that model performance is significantly worse for culturally specific questions than for culturally agnostic questions. Our findings highlight the need for further research in LLM multilingual capabilities and non-English LFQA evaluation.
翻译:大型语言模型(LLMs)通常用于长格式问答,这要求它们针对复杂问题生成段落长度的答案。尽管通过多种不同的数据集和评估指标,长格式问答在英语中已得到充分研究,但这一研究尚未扩展到覆盖大多数其他语言。为弥补这一空白,我们引入了CaLMQA,这是一个包含2.6K个复杂问题的数据集,涵盖23种语言,包括资源匮乏、研究稀少的语言,如斐济语和基隆迪语。我们的数据集既包含从社区网络论坛收集的自然出现的问题,也包含由我们为此目的聘请的母语者撰写的问题。我们的流程产生了多样且复杂的问题,这些问题反映了文化主题(例如传统、法律、新闻)以及母语者的语言使用习惯。我们使用我们新颖的指标CaLMScore(该指标能检测答案中的错误语言和标记重复)对一系列开源和闭源模型进行自动评估,并观察到对于某些低资源语言,LLM生成答案的质量显著下降。我们对部分模型进行了人工评估,发现模型在文化特定问题上的表现明显差于文化无关问题。我们的研究结果突显了在LLM多语言能力及非英语长格式问答评估方面进一步研究的必要性。