Question answering systems (QA) utilizing Large Language Models (LLMs) heavily depend on the retrieval component to provide them with domain-specific information and reduce the risk of generating inaccurate responses or hallucinations. Although the evaluation of retrievers dates back to the early research in Information Retrieval, assessing their performance within LLM-based chatbots remains a challenge. This study proposes a straightforward baseline for evaluating retrievers in Retrieval-Augmented Generation (RAG)-based chatbots. Our findings demonstrate that this evaluation framework provides a better image of how the retriever performs and is more aligned with the overall performance of the QA system. Although conventional metrics such as precision, recall, and F1 score may not fully capture LLMs' capabilities - as they can yield accurate responses despite imperfect retrievers - our method considers LLMs' strengths to ignore irrelevant contexts, as well as potential errors and hallucinations in their responses.
翻译:利用大语言模型(LLM)的问答系统高度依赖检索组件来提供领域特定信息,并降低生成不准确响应或幻觉的风险。尽管检索器评估可追溯至信息检索领域的早期研究,但在基于LLM的聊天机器人中评估其性能仍具挑战性。本研究提出了一种评估检索增强生成(RAG)聊天机器人中检索器的直接基准方法。我们的研究结果表明,该评估框架能更全面地反映检索器的实际表现,且与问答系统的整体性能更加吻合。尽管精确率、召回率和F1分数等传统指标可能无法完全捕捉LLM的能力——因为即便检索器存在缺陷,LLM仍可能产生准确响应——我们的方法同时考虑了LLM忽略无关上下文的优势及其响应中可能存在的错误与幻觉。