The BBQ (Bias Benchmark for Question Answering) dataset enables the evaluation of the social biases that language models (LMs) exhibit in downstream tasks. However, it is challenging to adapt BBQ to languages other than English as social biases are culturally dependent. In this paper, we devise a process to construct a non-English bias benchmark dataset by leveraging the English BBQ dataset in a culturally adaptive way and present the KoBBQ dataset for evaluating biases in Question Answering (QA) tasks in Korean. We identify samples from BBQ into three classes: Simply-Translated (can be used directly after cultural translation), Target-Modified (requires localization in target groups), and Sample-Removed (does not fit Korean culture). We further enhance the cultural relevance to Korean culture by adding four new categories of bias specific to Korean culture and newly creating samples based on Korean literature. KoBBQ consists of 246 templates and 4,740 samples across 12 categories of social bias. Using KoBBQ, we measure the accuracy and bias scores of several state-of-the-art multilingual LMs. We demonstrate the differences in the bias of LMs in Korean and English, clarifying the need for hand-crafted data considering cultural differences.
翻译:BBQ(问答偏见基准)数据集用于评估语言模型在下游任务中表现出的社会偏见。然而,由于社会偏见具有文化依赖性,将其适配到英语以外语言存在挑战。本文设计了一种流程,通过文化适应性方式利用英语BBQ数据集构建非英语偏见基准数据集,并提出了用于评估韩语问答任务偏见的KoBBQ数据集。我们将BBQ中的样本分为三类:简单翻译类(文化翻译后可直接使用)、目标修改类(需要本地化目标群体)和样本移除类(不符合韩国文化)。我们进一步通过新增四个韩国文化特有的偏见类别,并基于韩国文献创作新样本,增强了数据集的韩国文化相关性。KoBBQ包含246个模板和12个社会偏见过类别的4740个样本。利用KoBBQ,我们测量了多个最先进多语言语言模型的准确率和偏见分数。我们展示了语言模型在韩语和英语中偏见的差异,阐明了考虑文化差异进行人工构建数据的必要性。