In high-stakes domains like legal question-answering, the accuracy and trustworthiness of generative AI systems are of paramount importance. This work presents a comprehensive benchmark of various methods to assess the groundedness of AI-generated responses, aiming to significantly enhance their reliability. Our experiments include similarity-based metrics and natural language inference models to evaluate whether responses are well-founded in the given contexts. We also explore different prompting strategies for large language models to improve the detection of ungrounded responses. We validated the effectiveness of these methods using a newly created grounding classification corpus, designed specifically for legal queries and corresponding responses from retrieval-augmented prompting, focusing on their alignment with source material. Our results indicate potential in groundedness classification of generated responses, with the best method achieving a macro-F1 score of 0.8. Additionally, we evaluated the methods in terms of their latency to determine their suitability for real-world applications, as this step typically follows the generation process. This capability is essential for processes that may trigger additional manual verification or automated response regeneration. In summary, this study demonstrates the potential of various detection methods to improve the trustworthiness of generative AI in legal settings.
翻译:在法律问答等高风险领域,生成式人工智能系统的准确性与可信度至关重要。本研究提出了一个综合性基准,用于评估多种检测AI生成回答基于事实性的方法,旨在显著提升其可靠性。我们的实验涵盖了基于相似度的度量指标和自然语言推理模型,以评估回答是否在给定上下文中具有充分依据。同时,我们探索了大型语言模型的不同提示策略,以改进对无事实依据回答的检测能力。我们使用专门构建的事实性分类语料库验证了这些方法的有效性,该语料库针对法律查询及检索增强提示生成的对应回答设计,重点关注其与源材料的对齐程度。实验结果表明,生成回答的事实性分类具有显著潜力,最佳方法的宏观F1分数达到0.8。此外,我们还评估了各方法的延迟性能,以确定其在实际应用中的适用性——该步骤通常跟随生成过程执行。这种能力对于可能触发额外人工验证或自动回答重新生成的流程至关重要。总之,本研究证明了多种检测方法在提升法律场景中生成式AI可信度方面的潜力。