In this work, we introduce ChatQA, a family of conversational question answering (QA) models, that obtain GPT-4 level accuracies. Specifically, we propose a two-stage instruction tuning method that can significantly improve the zero-shot conversational QA results from large language models (LLMs). To handle retrieval in conversational QA, we fine-tune a dense retriever on a multi-turn QA dataset, which provides comparable results to using the state-of-the-art query rewriting model while largely reducing deployment cost. Notably, our ChatQA-70B can outperform GPT-4 in terms of average score on 10 conversational QA datasets (54.14 vs. 53.90), without relying on any synthetic data from OpenAI GPT models.
翻译:本文介绍ChatQA系列对话问答模型,其准确率达到GPT-4水平。具体而言,我们提出了一种两阶段指令调优方法,可显著提升大型语言模型在零样本对话问答任务上的性能。为处理对话问答中的检索环节,我们在多轮问答数据集上微调密集检索器,其效果媲美最先进的查询改写模型,同时大幅降低部署成本。值得注意的是,仅凭ChatQA-70B模型即可在10个对话问答数据集的平均得分上超越GPT-4(54.14 vs. 53.90),且无需依赖任何OpenAI GPT模型生成的合成数据。