Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.
翻译:检索增强生成(RAG)系统被越来越多地用于分析复杂的政策文档,但在以密集法律语言和不断演变的交叉监管框架为特征的领域中,实现满足专家使用的足够可靠性仍具有挑战性。我们基于人工智能治理与监管档案(AGORA)语料库——一个包含947份AI政策文件的精选集合,研究了RAG在AI治理与政策分析中的应用。该系统结合了基于对比学习微调的ColBERT检索器与通过直接偏好优化(DPO)对齐人类偏好的生成器。我们构建合成查询并收集成对偏好,使系统适应政策领域。通过评估检索质量、答案相关性和忠实度的实验,我们发现领域特定微调虽然改进了检索指标,但未能持续提升端到端的问答性能。在某些情况下,当语料库中缺乏相关文档时,更强的检索能力反而会产生更自信的幻觉。这些结果凸显了构建聚焦政策的RAG系统的关键问题:各组件的改进并不必然转化为更可靠的答案。我们的研究为在动态监管语料库上设计基于事实的问答系统提供了实践启示。