DBAutoDoc: Automated Discovery and Documentation of Undocumented Database Schemas via Statistical Analysis and Iterative LLM Refinement

A tremendous number of critical database systems lack adequate documentation. Declared primary keys are absent, foreign key constraints have been dropped for performance, column names are cryptic abbreviations, and no entity-relationship diagrams exist. We present DBAutoDoc, a system that automates the discovery and documentation of undocumented relational database schemas by combining statistical data analysis with iterative large language model (LLM) refinement. DBAutoDoc's central insight is that schema understanding is fundamentally an iterative, graph-structured problem. Drawing structural inspiration from backpropagation in neural networks, DBAutoDoc propagates semantic corrections through schema dependency graphs across multiple refinement iterations until descriptions converge. This propagation is discrete and semantic rather than mathematical, but the structural analogy is precise: early iterations produce rough descriptions akin to random initialization, and successive passes sharpen the global picture as context flows through the graph. The system makes four concrete contributions detailed in the paper. On a suite of benchmark databases, DBAutoDoc achieved overall weighted scores of 96.1% across two model families (Google's Gemini and Anthropic's Claude) using a composite metric. Ablation analysis demonstrates that the deterministic pipeline contributes a 23-point F1 improvement over LLM-only FK detection, confirming that the system's contribution is substantial and independent of LLM pre-training knowledge. DBAutoDoc is released as open-source software with all evaluation configurations and prompt templates included for full reproducibility.

翻译：大量关键数据库系统缺乏完善的文档。声明的主键缺失、为提升性能而删除外键约束、列名采用晦涩缩写、实体关系图完全缺失等问题普遍存在。本文提出DBAutoDoc系统，通过融合统计数据分析与迭代式大语言模型优化，自动发现并文档化未记录的关系型数据库模式。该系统的核心洞察在于：模式理解本质上是一个迭代的图结构问题。通过借鉴神经网络反向传播的结构思想，DBAutoDoc在多个优化迭代中沿模式依赖图传播语义修正，直至描述收敛。该传播过程是离散的语义过程而非数学过程，但其结构类比十分精确：早期迭代生成类似随机初始化的粗略描述，后续迭代通过图结构的上下文流动逐步完善全局图像。系统在论文中详述了四项具体贡献。在基准数据库测试套件上，使用复合评估指标，DBAutoDoc在两个模型族（Google Gemini与Anthropic Claude）上均取得96.1%的整体加权得分。消融分析表明，确定性流程相比仅使用大语言模型检测外键贡献了23个百分点的F1值提升，证实系统贡献独立于大语言模型预训练知识且效果显著。DBAutoDoc以开源软件形式发布，包含所有评估配置与提示模板，确保完全可复现。