Productive interactions between diverse users and language technologies require outputs from the latter to be culturally relevant and sensitive. Prior works have evaluated models' knowledge of cultural norms, values, and artifacts, without considering how this knowledge manifests in downstream applications. In this work, we focus on extrinsic evaluation of cultural competence in two text generation tasks, open-ended question answering and story generation. We quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. Although we find that model outputs do vary when varying nationalities and feature culturally relevant words, we also find weak correlations between text similarity of outputs for different countries and the cultural values of these countries. Finally, we discuss important considerations in designing comprehensive evaluation of cultural competence in user-facing tasks.
翻译:多样化的用户与语言技术之间的有效互动要求后者输出的内容具有文化相关性和敏感性。先前的研究评估了模型对文化规范、价值观及文化产物的认知,但未考虑这种认知如何在下游应用中体现。本研究聚焦于两个文本生成任务——开放式问答与故事生成——中文化能力的外部评估。我们通过定量与定性方法,评估当提示中明确的文化线索(特别是国籍)被扰动时模型的输出变化。尽管发现模型输出随国籍变化而改变且包含文化相关词汇,但我们也观察到不同国家输出文本的相似度与这些国家的文化价值观之间相关性较弱。最后,我们探讨了在设计面向用户任务的文化能力综合评估时需考虑的重要因素。