In many use-cases, information is stored in text but not available in structured data. However, extracting data from natural language text to precisely fit a schema, and thus enable querying, is a challenging task. With the rise of pre-trained Large Language Models (LLMs), there is now an effective solution to store and use information extracted from massive corpora of text documents. Thus, we envision the use of SQL queries to cover a broad range of data that is not captured by traditional databases by tapping the information in LLMs. To ground this vision, we present Galois, a prototype based on a traditional database architecture, but with new physical operators for querying the underlying LLM. The main idea is to execute some operators of the the query plan with prompts that retrieve data from the LLM. For a large class of SQL queries, querying LLMs returns well structured relations, with encouraging qualitative results. Preliminary experimental results make pre-trained LLMs a promising addition to the field of database systems, introducing a new direction for hybrid query processing. However, we pinpoint several research challenges that must be addressed to build a DBMS that exploits LLMs. While some of these challenges necessitate integrating concepts from the NLP literature, others offer novel research avenues for the DB community.
翻译:在许多应用场景中,信息以文本形式存储,但无法以结构化数据形式获取。然而,从自然语言文本中精确提取数据以适应特定模式并实现查询是一项具有挑战性的任务。随着预训练大型语言模型(LLMs)的兴起,现在有了有效存储和利用从海量文本文档语料库中提取信息的解决方案。因此,我们设想通过利用LLMs中的信息,使用SQL查询来覆盖传统数据库无法捕获的广泛数据。为实现这一愿景,我们提出了Galois原型系统,该系统基于传统数据库架构,但新增了用于查询底层LLM的物理算子。核心思想是通过提示词执行查询计划中的部分算子,从而从LLM中检索数据。对于大部分SQL查询而言,查询LLM能够返回结构良好的关系数据,并取得了令人鼓舞的定性结果。初步实验结果表明,预训练LLMs有望成为数据库系统领域的有力补充,为混合查询处理开辟了新方向。然而,我们指出了构建利用LLMs的数据库管理系统(DBMS)所需解决的若干研究挑战。其中部分挑战需要整合自然语言处理(NLP)领域的已有概念,而另一些则为数据库社区提供了全新的研究方向。