In recent years, Large Language Models (LLMs) have shown great potential across a wide range of legal tasks. Despite these advances, mitigating hallucination remains a significant challenge, with state-of-the-art LLMs still frequently generating incorrect legal references. In this paper, we focus on the problem of legal citation prediction within the Australian law context, where correctly identifying and citing relevant legislations or precedents is critical. We compare several approaches: prompting general purpose and law-specialised LLMs, retrieval-only pipelines with both generic and domain-specific embeddings, task-specific instruction-tuning of LLMs, and hybrid strategies that combine LLMs with retrieval augmentation, query expansion, or voting ensembles. Our findings indicate that domain-specific pre-training alone is insufficient for achieving satisfactory citation accuracy even after law-specialised pre-training. In contrast, instruction tuning on our task-specific dataset dramatically boosts performance reaching the best results across all settings. We also highlight that database granularity along with the type of embeddings play a critical role in the performance of retrieval systems. Among retrieval-based approaches, hybrid methods consistently outperform retrieval-only setups, and among these, ensemble voting delivers the best result by combining the predictive quality of instruction-tuned LLMs with the retrieval system.
翻译:近年来,大型语言模型(LLMs)在广泛的法律任务中展现出巨大潜力。尽管取得了这些进展,缓解幻觉生成仍然是一个重大挑战,当前最先进的大型语言模型仍频繁生成错误的法律参考文献。本文聚焦于澳大利亚法律背景下的法律引文预测问题,其中正确识别并引用相关法规或判例至关重要。我们比较了多种方法:提示通用型和法律专用大型语言模型、使用通用和领域特定嵌入的纯检索流程、针对特定任务的LLM指令微调,以及将LLM与检索增强、查询扩展或投票集成相结合的混合策略。我们的研究结果表明,即使经过法律专用预训练,仅依靠领域特定预训练仍不足以实现令人满意的引文准确性。相比之下,在我们任务特定数据集上进行指令微调能显著提升性能,在所有设置中均达到最佳结果。我们还强调,数据库的粒度以及嵌入类型对检索系统的性能起着关键作用。在基于检索的方法中,混合方法始终优于纯检索设置,其中集成投票方法通过结合指令微调LLM的预测质量与检索系统,取得了最佳结果。