Large language models (LLMs) are used for long-form question answering (LFQA), which requires them to generate paragraph-length answers to complex questions. While LFQA has been well-studied in English, this research has not been extended to other languages. To bridge this gap, we introduce CaLMQA, a collection of 1.5K complex culturally specific questions spanning 23 languages and 51 culturally agnostic questions translated from English into 22 other languages. We define culturally specific questions as those uniquely or more likely to be asked by people from cultures associated with the question's language. We collect naturally-occurring questions from community web forums and hire native speakers to write questions to cover under-resourced, rarely-studied languages such as Fijian and Kirundi. Our dataset contains diverse, complex questions that reflect cultural topics (e.g. traditions, laws, news) and the language usage of native speakers. We automatically evaluate a suite of open- and closed-source models on CaLMQA by detecting incorrect language and token repetitions in answers, and observe that the quality of LLM-generated answers degrades significantly for some low-resource languages. Lastly, we perform human evaluation on a subset of models and languages. Manual evaluation reveals that model performance is significantly worse for culturally specific questions than for culturally agnostic questions. Our findings highlight the need for further research in non-English LFQA and provide an evaluation framework.
翻译:大语言模型(LLM)被用于长形式问答(LFQA),该任务要求模型针对复杂问题生成段落长度的答案。尽管LFQA在英语领域已得到充分研究,但相关研究尚未扩展至其他语言。为填补这一空白,我们推出了CaLMQA数据集,其中包含1.5K个涵盖23种语言的复杂文化特定问题,以及51个从英语翻译为其他22种语言的文化无关问题。我们将文化特定问题定义为那些由与问题语言相关联的文化背景人群所特有或更可能提出的问题。我们从社区网络论坛收集自然产生的问题,并聘请母语者撰写问题以覆盖资源匮乏、研究稀少的语言(如斐济语和基隆迪语)。我们的数据集包含反映文化主题(如传统、法律、新闻)和母语者语言使用习惯的多样化复杂问题。我们通过检测答案中的错误语言和令牌重复,对一系列开源和闭源模型在CaLMQA上进行自动评估,发现LLM生成答案的质量在某些低资源语言中显著下降。最后,我们对部分模型和语言进行了人工评估。人工评估表明,模型在文化特定问题上的表现显著差于文化无关问题。我们的研究结果强调了非英语LFQA领域进一步研究的必要性,并提供了一个评估框架。