Large language models (LLMs) have rapidly evolved as the foundation of various natural language processing (NLP) applications. Despite their wide use cases, their understanding of culturally-related concepts and reasoning remains limited. Meantime, there is a significant need to enhance these models' cultural reasoning capabilities, especially concerning underrepresented regions. This paper introduces a novel pipeline for extracting high-quality, culturally-related instruction tuning datasets from vast unstructured corpora. We utilize a self-instruction generation pipeline to identify cultural concepts and trigger instruction. By integrating with a general-purpose instruction tuning dataset, our model demonstrates enhanced capabilities in recognizing and understanding regional cultural nuances, thereby enhancing its reasoning capabilities. We conduct experiments across three regions: Singapore, the Philippines, and the United States, achieving performance improvement of up to 6%. Our research opens new avenues for extracting cultural instruction tuning sets directly from unstructured data, setting a precedent for future innovations in the field.
翻译:大型语言模型(LLMs)已迅速发展为各类自然语言处理(NLP)应用的基础。尽管其应用场景广泛,但模型对文化相关概念与推理的理解仍存在局限。同时,提升这些模型的文化推理能力——特别是针对代表性不足地区——存在显著需求。本文提出一种新颖的流程,用于从海量非结构化语料库中提取高质量的文化相关指令微调数据集。我们采用自指令生成流程来识别文化概念并触发指令生成。通过将其与通用指令微调数据集相结合,我们的模型在识别和理解区域文化细微差异方面展现出增强的能力,从而提升了其推理性能。我们在新加坡、菲律宾和美国三个地区进行了实验,实现了最高达6%的性能提升。本研究为直接从非结构化数据中提取文化指令微调集开辟了新途径,为该领域的未来创新树立了先例。