Large language models (LLMs) have rapidly evolved as the foundation of various natural language processing (NLP) applications. Despite their wide use cases, their understanding of culturally-related concepts and reasoning remains limited. Meantime, there is a significant need to enhance these models' cultural reasoning capabilities, especially concerning underrepresented regions. This paper introduces a novel pipeline for extracting high-quality, culturally-related instruction tuning datasets from vast unstructured corpora. We utilize a self-instruction generation pipeline to identify cultural concepts and trigger instruction. By integrating with a general-purpose instruction tuning dataset, our model demonstrates enhanced capabilities in recognizing and understanding regional cultural nuances, thereby enhancing its reasoning capabilities. We conduct experiments across three regions: Singapore, the Philippines, and the United States, achieving performance improvement of up to 6%. Our research opens new avenues for extracting cultural instruction tuning sets directly from unstructured data, setting a precedent for future innovations in the field.
翻译:大语言模型(LLMs)已迅速演变为各类自然语言处理(NLP)应用的基础。尽管其应用广泛,但对文化相关概念及推理的理解仍存在局限性。与此同时,增强这些模型的文化推理能力——尤其是针对代表性不足的地区——具有显著需求。本文提出了一种创新流程,用于从海量非结构化语料中提取高质量的文化相关指令调优数据集。我们利用自指令生成流水线识别文化概念并触发指令。通过与通用指令调优数据集整合,本模型在识别和理解区域文化细微差异方面展现出增强能力,从而提升了推理性能。我们在新加坡、菲律宾和美国三个地区开展实验,性能提升最高达6%。本研究为直接从非结构化数据中提取文化指令调优集开辟了新途径,为领域未来的创新奠定了先例。