Legal QA benchmarks have predominantly focused on case law, overlooking the unique challenges of statute-centric regulatory reasoning. In statutory domains, relevant evidence is distributed across hierarchically linked documents, creating a statutory retrieval gap where conventional retrievers fail and models often hallucinate under incomplete context. We introduce SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA. Instantiated on fire-safety regulations as a representative case, the benchmark evaluates whether models can retrieve hierarchically fragmented evidence and safely abstain when statutory context is insufficient. SearchFireSafety adopts a dual-source evaluation framework combining real-world questions that require citation-aware retrieval and synthetic partial-context scenarios that stress-test hallucination and refusal behavior. Experiments across multiple large language models show that graph-guided retrieval substantially improves performance, but also reveal a critical safety trade-off: domain-adapted models are more likely to hallucinate when key statutory evidence is missing. Our findings highlight the need for benchmarks that jointly evaluate hierarchical retrieval and model safety in statute-centric regulatory settings.
翻译:法律问答基准测试主要集中在判例法上,忽视了法规中心型监管推理的独特挑战。在法规领域,相关证据分布在层级关联的文档中,造成了法规检索鸿沟——传统检索器在此失败,而模型在不完整语境下常产生幻觉。我们提出SearchFireSafety,一个面向法规中心型法律问答的结构与安全感知基准测试。以消防安全法规作为典型案例实例化后,该基准评估模型能否检索层级碎片化证据,并在法规语境不足时安全地弃答。SearchFireSafety采用双源评估框架,结合需要引文感知检索的现实世界问题与压力测试幻觉及拒答行为的合成部分语境场景。对多个大语言模型的实验表明,图引导检索显著提升了性能,但也揭示了一个关键的安全权衡:当关键法规证据缺失时,领域适配模型更易产生幻觉。我们的研究结果凸显了在法规中心型监管场景中,需要能够联合评估层级检索与模型安全性的基准测试。