Retrieval-Augmented Generation (RAG) has become critical for knowledge-intensive applications, yet evaluating its performance in vertical domains remains difficult due to domain complexity, diverse context scales, and heavy reliance on expert assessments that are costly, inconsistent, and non-scalable. We introduce FAB-Bench, an end-to-end framework for adaptive benchmarking of RAG systems in semiconductor manufacturing. FAB-Bench defines six diagnostic metrics measuring factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, and reasoning consistency. The framework couples retriever diagnostics with generator-level reasoning analysis across context windows of 4K-32K tokens, quantifying how retrieval precision and generative fidelity co-evolve as contextual scope expands. From over 1,300 generated candidates, we curated a high-quality benchmark of 200 query-answer pairs spanning three synthesis strategies: needle-in-haystack, intra-document multi-topic, and cross-document multi-hop. Systematic evaluation across four LLMs and four RAG frameworks reveals three distinct context-scaling behaviors: logarithmic growth, early saturation, and cold-start dynamics, and identifies attention dilution as the primary mechanism behind performance degradation at extreme context lengths. Cross-framework validation on three additional production RAG systems confirms evaluation portability.
翻译:检索增强生成(RAG)已成为知识密集型应用的关键技术,然而在垂直领域评估其性能仍面临领域复杂性、多尺度上下文环境以及对专家评估高度依赖(成本高昂、结果不一致且不可扩展)等挑战。我们提出FAB-Bench,一个用于半导体制造领域RAG系统自适应基准测试的端到端框架。该框架定义六项诊断指标,分别衡量事实准确性、上下文利用率、完整性、检索相关性、技术深度与推理一致性。通过结合检索器诊断与生成器在4K-32K令牌上下文窗口内的推理分析,该框架量化了检索精度与生成保真度随上下文范围扩展的协同演变关系。从1300余个生成候选样本中,我们构建了包含200个查询-答案对的高质量基准集,涵盖三种合成策略:大海捞针、文档内多主题与跨文档多跳。基于四种大语言模型与四种RAG框架的系统性评估揭示了三种上下文尺度缩放行为:对数增长、早期饱和与冷启动动态,并识别出注意力稀释是极端上下文长度下性能退化的首要机制。在三套额外生产级RAG系统上的跨框架验证,进一步确认了评估的可迁移性。