This paper introduces DW-Bench, a new benchmark that evaluates large language models (LLMs) on graph-topology reasoning over data warehouse schemas, explicitly integrating both foreign-key (FK) and data-lineage edges. The benchmark comprises 1,046 automatically generated, verifiably correct questions across five schemas. Experiments show that tool-augmented methods substantially outperform static approaches but plateau on hard compositional subtypes.
翻译:本文提出了DW-Bench,这是一个新的基准测试,用于评估大语言模型在数据仓库模式上的图拓扑推理能力,明确整合了外键和数据血缘边。该基准测试包含1,046个自动生成且可验证正确的题目,涵盖五个模式。实验表明,工具增强方法显著优于静态方法,但在组合子类型难题上表现趋于平稳。