We present the design and implementation of a RAG-based AI system benchmarking (RAGPerf) framework for characterizing the system behaviors of RAG pipelines. To facilitate detailed profiling and fine-grained performance analysis, RAGPerf decouples the RAG workflow into several modular components - embedding, indexing, retrieval, reranking, and generation. RAGPerf offers the flexibility for users to configure the core parameters of each component and examine their impact on the end-to-end query performance and quality. RAGPerf has a workload generator to model real-world scenarios by supporting diverse datasets (e.g., text, pdf, code, and audio), different retrieval and update ratios, and query distributions. RAGPerf also supports different embedding models, major vector databases such as LanceDB, Milvus, Qdrant, Chroma, and Elasticsearch, as well as different LLMs for content generation. It automates the collection of performance metrics (i.e., end-to-end query throughput, host/GPU memory footprint, and CPU/GPU utilization) and accuracy metrics (i.e., context recall, query accuracy, and factual consistency). We demonstrate the capabilities of RAGPerf through a comprehensive set of experiments and open source its codebase at GitHub. Our evaluation shows that RAGPerf incurs negligible performance overhead.
翻译:本文提出了一种基于RAG的人工智能系统基准测试(RAGPerf)框架的设计与实现,用于表征RAG流水线的系统行为。为支持细粒度性能剖析与分析,RAGPerf将RAG工作流解耦为若干模块化组件——嵌入、索引、检索、重排序与生成。该框架允许用户灵活配置各核心组件的参数,并考察其对端到端查询性能与质量的影响。RAGPerf内置工作负载生成器,通过支持多样化数据集(如文本、PDF、代码及音频)、不同检索与更新比例以及查询分布来模拟真实场景。同时支持多种嵌入模型、主流向量数据库(包括LanceDB、Milvus、Qdrant、Chroma及Elasticsearch)以及用于内容生成的不同大语言模型。系统可自动化采集性能指标(包括端到端查询吞吐量、主机/GPU内存占用及CPU/GPU利用率)与精度指标(包括上下文召回率、查询准确率及事实一致性)。我们通过一系列综合实验展示了RAGPerf的功能特性,并在GitHub开源其代码库。评估结果表明,RAGPerf引入的性能开销可忽略不计。