Large Language Models (LLM) have become a popular approach for implementing Retrieval Augmented Generation (RAG) systems, and a significant amount of effort has been spent on building good models and metrics. In spite of increased recognition of the need for rigorous evaluation of RAG systems, few tools exist that go beyond the creation of model output and automatic calculation. We present InspectorRAGet, an introspection platform for RAG evaluation. InspectorRAGet allows the user to analyze aggregate and instance-level performance of RAG systems, using both human and algorithmic metrics as well as annotator quality. InspectorRAGet is suitable for multiple use cases and is available publicly to the community. The demo video is available at https://youtu.be/MJhe8QIXcEc
翻译:大型语言模型(LLM)已成为实现检索增强生成(RAG)系统的流行方法,且大量工作致力于构建优质模型与评估指标。尽管学界日益认识到严格评估RAG系统的必要性,但目前除模型输出生成和自动计算外,鲜有工具能提供更深入的分析功能。我们提出InspectorRAGet——一个面向RAG评估的内省平台。该平台允许用户通过人类与算法指标结合标注质量分析,对RAG系统进行聚合级与实例级性能评估。InspectorRAGet支持多种应用场景,现已向社区开放。演示视频地址:https://youtu.be/MJhe8QIXcEc