This study introduces a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) in understanding and processing cultural knowledge, with a specific focus on Hakka culture as a case study. Leveraging Bloom's Taxonomy, the study develops a multi-dimensional framework that systematically assesses LLMs across six cognitive domains: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. This benchmark extends beyond traditional single-dimensional evaluations by providing a deeper analysis of LLMs' abilities to handle culturally specific content, ranging from basic recall of facts to higher-order cognitive tasks such as creative synthesis. Additionally, the study integrates Retrieval-Augmented Generation (RAG) technology to address the challenges of minority cultural knowledge representation in LLMs, demonstrating how RAG enhances the models' performance by dynamically incorporating relevant external information. The results highlight the effectiveness of RAG in improving accuracy across all cognitive domains, particularly in tasks requiring precise retrieval and application of cultural knowledge. However, the findings also reveal the limitations of RAG in creative tasks, underscoring the need for further optimization. This benchmark provides a robust tool for evaluating and comparing LLMs in culturally diverse contexts, offering valuable insights for future research and development in AI-driven cultural knowledge preservation and dissemination.
翻译:本研究引入一个综合性基准,旨在评估大语言模型在理解和处理文化知识方面的性能,并特别以客家文化作为案例研究。通过运用布鲁姆分类法,本研究开发了一个多维框架,系统性地评估大语言模型在六个认知领域的表现:记忆、理解、应用、分析、评价和创造。该基准超越了传统的单维度评估,通过深入分析大语言模型处理文化特定内容的能力——从基础事实回忆到创造性综合等高阶认知任务——提供了更全面的评估视角。此外,本研究整合了检索增强生成技术,以应对少数族群文化知识在大语言模型中的表征挑战,并展示了RAG如何通过动态整合相关外部信息来提升模型性能。结果表明,RAG在提升所有认知领域(尤其是需要精确检索和应用文化知识的任务)的准确性方面具有显著效果。然而,研究结果也揭示了RAG在创造性任务中的局限性,强调了进一步优化的必要性。该基准为在文化多样性背景下评估和比较大语言模型提供了一个强有力的工具,为未来人工智能驱动的文化知识保存与传播研究提供了宝贵的洞见。