Large language models (LLMs) often struggle in specialized domains such as legal reasoning due to limited expert knowledge, resulting in factually incorrect outputs or hallucinations. This paper presents an effective method for adapting advanced LLMs to German legal question answering through a novel synthetic data generation approach. In contrast to costly human-annotated resources or unreliable synthetic alternatives, our approach systematically produces high-quality, diverse, and legally accurate question-answer pairs directly from authoritative German statutes. Using rigorous automated filtering methods and parameter-efficient fine-tuning techniques, we demonstrate that LLMs adapted with our synthetic dataset significantly outperform their baseline counterparts on German legal question answering tasks. Our results highlight the feasibility of using carefully designed synthetic data as a robust alternative to manual annotation in high-stakes, knowledge-intensive domains.
翻译:大语言模型(LLMs)在专业领域(如法律推理)中常因专家知识有限而表现不佳,导致事实性错误输出或幻觉。本文提出一种通过新颖的合成数据生成方法,将先进LLMs适配至德国法律问答任务的有效途径。相较于昂贵的人工标注资源或不可靠的合成替代方案,我们的方法能够直接从德国权威法规中系统性地生成高质量、多样化且法律层面准确的问题-答案对。通过严格的自动化过滤方法与参数高效的微调技术,我们证明了使用本合成数据集适配的LLMs在德国法律问答任务上显著优于基线模型。研究结果凸显了在知识密集型高风险领域中,精心设计的合成数据作为人工标注稳健替代方案的可行性。