Productive interactions between diverse users and language technologies require outputs from the latter to be culturally relevant and sensitive. Prior works have evaluated models' knowledge of cultural norms, values, and artifacts, without considering how this knowledge manifests in downstream applications. In this work, we focus on extrinsic evaluation of cultural competence in two text generation tasks, open-ended question answering and story generation. We quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. Although we find that model outputs do vary when varying nationalities and feature culturally relevant words, we also find weak correlations between text similarity of outputs for different countries and the cultural values of these countries. Finally, we discuss important considerations in designing comprehensive evaluation of cultural competence in user-facing tasks.
翻译:多样化的用户与语言技术之间的有效互动,要求后者输出的内容具有文化相关性和敏感性。先前的研究评估了模型对文化规范、价值观和人工制品的知识,但未考虑这些知识如何在下游应用中体现。在本工作中,我们聚焦于两个文本生成任务——开放式问答和故事生成——中文化能力的外部评估。我们定量和定性地评估了当提示中明确的文化线索(特别是国籍)被扰动时模型的输出。尽管我们发现模型输出确实随国籍变化而变化,并包含文化相关的词汇,但我们也发现不同国家输出文本的相似性与这些国家的文化价值观之间相关性较弱。最后,我们讨论了在设计面向用户任务中文化能力的全面评估时需考虑的重要因素。