ViHERMES: A Graph-Grounded Multihop Question Answering Benchmark and System for Vietnamese Healthcare Regulations

Question Answering (QA) over regulatory documents is inherently challenging due to the need for multihop reasoning across legally interdependent texts, a requirement that is particularly pronounced in the healthcare domain where regulations are hierarchically structured and frequently revised through amendments and cross-references. Despite recent progress in retrieval-augmented and graph-based QA methods, systematic evaluation in this setting remains limited, especially for low-resource languages such as Vietnamese, due to the lack of benchmark datasets that explicitly support multihop reasoning over healthcare regulations. In this work, we introduce the Vietnamese Healthcare Regulations-Multihop Reasoning Dataset (ViHERMES), a benchmark designed for multihop QA over Vietnamese healthcare regulatory documents. ViHERMES consists of high-quality question-answer pairs that require reasoning across multiple regulations and capture diverse dependency patterns, including amendment tracing, cross-document comparison, and procedural synthesis. To construct the dataset, we propose a controlled multihop QA generation pipeline based on semantic clustering and graph-inspired data mining, followed by large language model-based generation with structured evidence and reasoning annotations. We further present a graph-aware retrieval framework that models formal legal relations at the level of legal units and supports principled context expansion for legally valid and coherent answers. Experimental results demonstrate that ViHERMES provides a challenging benchmark for evaluating multihop regulatory QA systems and that the proposed graph-aware approach consistently outperforms strong retrieval-based baselines. The ViHERMES dataset and system implementation are publicly available at https://github.com/ura-hcmut/ViHERMES.

翻译：在监管文档上进行问答（QA）本质上具有挑战性，因为它需要跨越法律上相互依存的文本进行多跳推理，这一要求在医疗领域尤为突出，因为该领域的法规呈层级结构，且经常通过修订案和交叉引用进行更新。尽管基于检索增强和图的方法在问答领域取得了最新进展，但在此场景下的系统性评估仍然有限，特别是对于越南语等低资源语言，这主要是由于缺乏明确支持对医疗法规进行多跳推理的基准数据集。在本工作中，我们引入了越南医疗法规多跳推理数据集（ViHERMES），这是一个为越南医疗监管文档多跳问答设计的基准。ViHERMES包含高质量的问答对，这些问答对需要跨越多条法规进行推理，并捕捉了多样的依赖模式，包括修订案追踪、跨文档比较和流程综合。为了构建该数据集，我们提出了一种基于语义聚类和图启发的数据挖掘的受控多跳问答生成流程，随后利用带有结构化证据和推理标注的大语言模型进行生成。我们进一步提出了一种图感知检索框架，该框架在法律单元层面建模正式的法律关系，并支持为获得法律上有效且连贯的答案进行有原则的上下文扩展。实验结果表明，ViHERMES为评估多跳监管问答系统提供了一个具有挑战性的基准，并且所提出的图感知方法在性能上持续优于强大的基于检索的基线方法。ViHERMES数据集和系统实现已在 https://github.com/ura-hcmut/ViHERMES 公开提供。