Retrieval-augmented generation (RAG) improves large language model reliability by grounding generated responses in external evidence. However, RAG performance depends on the relevance of retrieved passages, the quality of evidence ranking, and the ability to verify whether generated claims are supported by source documents. This study presents a hybrid retrieval and reranking framework for citation-aware RAG in biomedical and healthcare-related document question answering. The framework uses Amazon Bedrock Knowledge Bases for document ingestion, parsing, chunking, embedding generation, and evidence retrieval. Source PDF documents are stored in Amazon S3, embedded using Amazon Titan Text Embeddings V2, and indexed with Amazon OpenSearch Serverless. Hybrid retrieval first retrieves candidate evidence chunks, and Cohere reranking then prioritizes the most relevant passages before answer generation. The answer-generation stage uses top-ranked evidence chunks to produce controlled, evidence-grounded responses, while a separate judge model evaluates each generated factual claim against the retrieved evidence. The framework was evaluated using 25 biomedical NLP and healthcare transformer queries as a pilot-scale proof-of-concept study. Across the evaluation set, the system retrieved and reranked 500 evidence chunks and generated answers from top-ranked evidence. Claim-level grounding evaluation extracted 200 factual claims, all of which were judged to be supported by retrieved evidence, resulting in 100.0% grounding accuracy. The results suggest that hybrid retrieval, reranking, conservative prompting, and claim-level evaluation can support reliable evidence-grounded RAG responses when sufficient source evidence is available.
翻译:检索增强生成(RAG)通过将生成的回答锚定于外部证据,提升了大型语言模型的可靠性。然而,RAG的性能取决于检索段落的相关性、证据排序的质量,以及验证生成声明是否得到源文档支撑的能力。本研究提出了一种面向生物医学与医疗保健领域文档问答的、具有引文感知能力的混合检索与重排序RAG框架。该框架利用Amazon Bedrock知识库进行文档摄取、解析、分块、嵌入生成及证据检索。源PDF文档存储于Amazon S3,使用Amazon Titan Text Embeddings V2进行嵌入,并通过Amazon OpenSearch Serverless构建索引。混合检索首先获取候选证据片段,随后由Cohere重排序模型在答案生成前优先筛选出最相关的段落。答案生成阶段利用排名最高的证据片段生成受控且基于证据的响应,同时一个独立的评判模型将每个生成的事实性声明与检索到的证据进行比对评估。作为一项初步的概念验证研究,该框架使用25个生物医学自然语言处理与医疗保健Transformer查询进行了评估。在整个评估集上,系统检索并重排序了500个证据片段,并基于排名最靠前的证据生成答案。在声明层面的基础性评估中,共提取了200个事实性声明,所有声明均被判定得到检索证据的支持,达到了100.0%的基础性准确率。结果表明,当存在充足的源证据时,混合检索、重排序、保守提示以及声明层面评估相结合,能够支撑生成可靠且基于证据的RAG响应。