Retrieval-Augmented Generation (RAG) fine-tuning has shown substantial improvements over vanilla RAG, yet most studies target document question answering and often rely on standard NLP metrics that can obscure factual differences. We evaluate RAG fine-tuning for long-form text generation in electronic design automation, adapting a 7B model under five context augmentation strategies with varying retrieval conditions. We introduce TriFEX, a human-validated, triple-based evaluation pipeline that attributes generated claims to their origin-user query, context and reference-and propose Parametric Knowledge Precision (PKP), which isolates internalized knowledge by filtering out claims leaked in the prompt. We show that ROUGE and BERTScore fail to detect factual differences that our triple-based evaluation reveals. Additionally, we demonstrate that an existing metric for knowledge internalization is retrieva-sensitive, with about 75% of its cross-condition variance driven by changes in the rate at which internal knowledge is expressed (PR), rather than by changes in its actual correctness (PKP). The fine-tuned 7B variants outperform a 72B baseline on most metrics, further showing generalization across conditions and on a related benchmark. These results underscore the limitations of available metrics in RAG evaluation and show that smaller models could be reasonably well adapted to specialized tasks for cost-efficient, on-premises deployment.
翻译:检索增强生成(RAG)微调相较于普通RAG已展现出显著改进,然而现有研究多聚焦于文档问答,且常依赖可能掩盖事实差异的标准自然语言处理评估指标。本研究针对电子设计自动化中的长文本生成任务,在五种上下文增强策略及不同检索条件下对7B模型进行微调评估。我们提出TriFEX——一种经人工验证的基于三元组的评估流程,可将生成内容追溯至其来源(用户查询、上下文与参考文档),并引入参数化知识精确度(PKP)指标,通过过滤提示中泄露的内容来隔离模型内化知识。研究表明ROUGE和BERTScore无法检测出三元组评估所揭示的事实差异。此外,我们证明现有知识内化指标对检索条件敏感,其跨条件约75%的方差源自内化知识表达频率(PR)的变化,而非实际正确性(PKP)的变化。微调后的7B变体在多数指标上超越72B基线模型,进一步展示了跨条件泛化能力及在相关基准测试中的表现。这些结果凸显了RAG评估中现有指标的局限性,表明小型模型可通过合理适配专项任务实现经济高效的本地化部署。