Context graphs are essential for modern AI applications including question answering, pattern discovery, and data analysis. Building accurate context graphs from structured databases requires inferring join relationships between entities. Invalid joins introduce ambiguity and duplicate records, compromising graph quality. We present a scalable join inference approach combining statistical pruning with Large Language Model (LLM) reasoning. Unlike purely statistics-based methods, our hybrid approach mimics human semantic understanding while mitigating LLM hallucination through data-driven inference. We first identify primary key candidates and use LLMs for adjudication, then detect inclusion dependencies with the same two-stage process. This statistics-LLM combination scales to large schemas while maintaining accuracy and minimizing false positives. We further leverage the database query history to refine the join inferences over time as the query workloads evolve. Our evaluation on TPC-DS, TPC-H, BIRD-Dev, and production workloads demonstrates that the approach achieves high precision (78-100%) on well-structured schemas, while highlighting the inherent difficulty of join discovery in poorly normalized settings.
翻译:上下文图对于现代人工智能应用至关重要,涵盖问答系统、模式发现和数据分析等领域。从结构化数据库构建精确的上下文图需要推断实体间的连接关系。无效连接会引入歧义和重复记录,从而损害图的质量。我们提出一种可扩展的连接推理方法,该方法将统计剪枝与大型语言模型推理相结合。与纯基于统计的方法不同,我们的混合方法模拟了人类的语义理解,同时通过数据驱动的推理减轻了LLM的幻觉问题。我们首先识别主键候选,并利用LLM进行判定,随后通过相同的两阶段流程检测包含依赖关系。这种统计与LLM结合的方法能够扩展到大规模数据库模式,同时保持准确性并最小化误报。我们进一步利用数据库查询历史,随着查询工作负载的演变,持续优化连接推理。我们在TPC-DS、TPC-H、BIRD-Dev及生产工作负载上的评估表明,该方法在结构良好的数据库模式上实现了高精度(78-100%),同时凸显了在规范化程度较差的环境中进行连接发现的内在困难。