In this work, we introduce ChatQA, a suite of models that outperform GPT-4 on retrieval-augmented generation (RAG) and conversational question answering (QA). To enhance generation, we propose a two-stage instruction tuning method that significantly boosts the performance of RAG. For effective retrieval, we introduce a dense retriever optimized for conversational QA, which yields results comparable to the alternative state-of-the-art query rewriting models, while substantially reducing deployment costs. We also present the ChatRAG Bench, which encompasses ten datasets covering comprehensive evaluations on RAG, table-related QA, arithmetic calculations, and scenarios involving unanswerable questions. Our ChatQA-1.0-70B (score: 54.14), built on Llama2, a weaker foundation model than GPT-4, can slightly outperform GPT-4-0613 (score: 53.90) and GPT-4-Turbo-2024-04-09 (score: 54.03) on the ChatRAG Bench, without relying on any synthetic data from OpenAI GPT models. Notably, the Llama3-ChatQA-1.5-70B model surpasses the accuracy of GPT-4-Turbo-2024-04-09, achieving a 4.4% improvement. To advance research in this field, we open-sourced the model weights, instruction tuning data, ChatRAG Bench, and retriever for the community: https://chatqa-project.github.io/.
翻译:本文提出了ChatQA模型系列,其在检索增强生成与对话式问答任务上的表现超越了GPT-4。为提升生成质量,我们提出一种两阶段指令微调方法,显著提高了检索增强生成的性能。针对高效检索,我们引入了一种专为对话式问答优化的稠密检索器,其效果可与当前最先进的查询重写模型相媲美,同时大幅降低了部署成本。我们还提出了ChatRAG评测基准,涵盖十个数据集,对检索增强生成、表格相关问答、算术计算及涉及不可回答问题等多种场景进行全面评估。基于Llama2(一个弱于GPT-4的基础模型)构建的ChatQA-1.0-70B(得分:54.14)在ChatRAG基准上,无需依赖任何来自OpenAI GPT模型的合成数据,即可略微超越GPT-4-0613(得分:53.90)和GPT-4-Turbo-2024-04-09(得分:54.03)。值得注意的是,Llama3-ChatQA-1.5-70B模型在准确率上超越了GPT-4-Turbo-2024-04-09,实现了4.4%的性能提升。为推进该领域研究,我们向社区开源了模型权重、指令微调数据、ChatRAG基准及检索器:https://chatqa-project.github.io/。