To enhance language models' cultural awareness, we design a generalizable pipeline to construct cultural knowledge bases from different online communities on a massive scale. With the pipeline, we construct CultureBank, a knowledge base built upon users' self-narratives with 12K cultural descriptors sourced from TikTok and 11K from Reddit. Unlike previous cultural knowledge resources, CultureBank contains diverse views on cultural descriptors to allow flexible interpretation of cultural knowledge, and contextualized cultural scenarios to help grounded evaluation. With CultureBank, we evaluate different LLMs' cultural awareness, and identify areas for improvement. We also fine-tune a language model on CultureBank: experiments show that it achieves better performances on two downstream cultural tasks in a zero-shot setting. Finally, we offer recommendations based on our findings for future culturally aware language technologies. The project page is https://culturebank.github.io . The code and model is at https://github.com/SALT-NLP/CultureBank . The released CultureBank dataset is at https://huggingface.co/datasets/SALT-NLP/CultureBank .
翻译:为提升语言模型的文化感知能力,我们设计了一套通用化流程,能够从不同在线社区大规模构建文化知识库。基于该流程,我们构建了文化银行(CultureBank)——一个依据用户自述构建的知识库,其中包含来自TikTok的1.2万个文化描述项及来自Reddit的1.1万个文化描述项。与以往文化知识资源不同,文化银行呈现了对文化描述项的多元视角,支持对文化知识的灵活解读,并提供情境化的文化场景以助力实证评估。借助文化银行,我们评估了不同大型语言模型的文化感知能力,并识别出待改进领域。我们还基于文化银行微调了一个语言模型:实验表明,该模型在零样本设置下的两项下游文化任务中取得了更优性能。最后,我们基于研究发现为未来文化感知语言技术提出建议。项目页面见https://culturebank.github.io ,代码与模型见https://github.com/SALT-NLP/CultureBank ,已发布的文化银行数据集见https://huggingface.co/datasets/SALT-NLP/CultureBank。