Classical Chinese is a gateway to the rich heritage and wisdom of ancient China, yet its complexities pose formidable comprehension barriers for most modern people without specialized knowledge. While Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), they struggle with Classical Chinese Understanding (CCU), especially in data-demanding and knowledge-intensive tasks. In response to this dilemma, we propose \textbf{TongGu} (mean understanding ancient and modern), the first CCU-specific LLM, underpinned by three core contributions. First, we construct a two-stage instruction-tuning dataset ACCN-INS derived from rich classical Chinese corpora, aiming to unlock the full CCU potential of LLMs. Second, we propose Redundancy-Aware Tuning (RAT) to prevent catastrophic forgetting, enabling TongGu to acquire new capabilities while preserving its foundational knowledge. Third, we present a CCU Retrieval-Augmented Generation (CCU-RAG) technique to reduce hallucinations based on knowledge-grounding. Extensive experiments across 24 diverse CCU tasks validate TongGu's superior ability, underscoring the effectiveness of RAT and CCU-RAG. The model and dataset are available at \url{https://github.com/SCUT-DLVCLab/TongGu-LLM}.
翻译:文言文是通往中国古代丰富文化遗产与智慧的门户,但其复杂性对大多数缺乏专业知识的现代人构成了巨大的理解障碍。尽管大语言模型在自然语言处理领域展现出卓越能力,其在文言文理解任务上仍面临挑战,特别是在数据需求高和知识密集型的任务中。针对这一困境,我们提出了首个专用于文言文理解的大语言模型——\textbf{通古}(寓意贯通古今),其核心贡献包含三个方面。首先,我们基于丰富的文言文语料库构建了双阶段指令微调数据集 ACCN-INS,旨在充分释放大语言模型的文言文理解潜力。其次,我们提出冗余感知微调方法以防止灾难性遗忘,使通古模型在获得新能力的同时保持其基础知识。第三,我们提出基于知识增强的文言文检索增强生成技术,以降低模型幻觉。在涵盖 24 项多样化文言文理解任务的广泛实验中,通古模型均表现出卓越性能,验证了冗余感知微调与文言文检索增强生成技术的有效性。模型与数据集已开源:\url{https://github.com/SCUT-DLVCLab/TongGu-LLM}。