Analytics on structured data is a mature field with many successful methods. However, most real world data exists in unstructured form, such as images and conversations. We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics. In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections. This engine accepts queries in a Universal Query Language (UQL), a dialect of SQL that provides full natural language flexibility in specifying conditions and operators. The new engine leverages the ability of LLMs to conduct analysis of unstructured data, while also allowing us to exploit advances in sampling and optimization techniques to achieve efficient and accurate query execution. In addition, we borrow techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls. We demonstrate the efficiency of UQE on data analytics across different modalities, including images, dialogs and reviews, across a range of useful query types, including conditional aggregation, semantic retrieval and abstraction aggregation.
翻译:结构化数据分析是一个成熟的研究领域,已发展出众多成功的方法。然而,现实世界中的大多数数据以非结构化形式存在,例如图像和对话。本研究探讨了大型语言模型(LLMs)在实现非结构化数据分析方面的潜力。具体而言,我们提出了一种新型通用查询引擎(UQE),它能够直接对非结构化数据集合进行查询并从中提取洞见。该引擎接受通用查询语言(UQL)编写的查询语句,UQL是SQL的一种变体,其在指定查询条件和操作符时提供了完全的自然语言灵活性。新引擎充分利用了LLMs分析非结构化数据的能力,同时使我们能够借助采样与优化技术的最新进展,实现高效且准确的查询执行。此外,我们借鉴了经典编译理论中的技术,以更好地协调采样方法与基础模型调用之间的工作流程。我们在多种模态的数据分析任务上验证了UQE的效率,包括图像、对话和评论数据,并覆盖了条件聚合、语义检索和抽象聚合等一系列实用查询类型。