Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets. Our code and datasets are publicly available at: https://github.com/Mushtari-Sadia/SQUiD.
翻译:关系数据库是现代数据管理的核心,然而大多数数据以文本文档等非结构化形式存在。为弥合这一差距,我们利用大语言模型(LLMs)从原始文本自动生成数据库模式并填充其表格,从而合成关系数据库。我们提出了SQUiD——一种新颖的神经符号框架,该框架将此任务分解为四个阶段,每个阶段均采用专门技术。实验表明,SQUiD在多种数据集上均持续优于基线方法。我们的代码与数据集已公开于:https://github.com/Mushtari-Sadia/SQUiD。