Unstructured data(e.g., images, videos, PDF files, etc.) contain semantic information, for example, the facial feature of a person and the plate number of a vehicle. There could be semantic relationships between data items which are not explicitly represented. For example, a person's face may appear in two irrelevant photos. Also, much information is represented as structured data(e.g., the person's name and age). End-users prefer to query the semantic information from unstructured data together with structured data based on the potential relationships among them. However, due to the lack of a unified database system for structured and unstructured data, developers have to comprise multiple systems and runtime together to answer these queries. In this work, we build an open-source graph database named PandaDB to consistently manage and query structured and unstructured data. We first introduce a graph data model to manage structured and unstructured data, then propose a new query language to understand the semantics of the unstructured data in the graph. Next, we develop a new cost model and related query optimization techniques to speed up the unstructured data processing pipeline. Finally, we optimize the unstructured data storage and provide the index to speed up the query processing over unstructured data. PandaDB is widely used in industrial applications like FinTech, Knowledge Graph, and Recommendation System. The results show PandaDB can support a large scale of unstructured data query processing in a graph.
翻译:非结构化数据(例如图像、视频、PDF文件等)包含语义信息,例如一个人的面部特征和车辆的牌号等。在数据项目之间可能存在语义关系,但没有明确代表。例如,一个人的脸面可能出现在两张不相关的照片中。此外,许多信息作为结构化数据(例如,一个人的姓名和年龄)来表示。最终用户倾向于从非结构化数据中查询语义信息以及基于他们之间潜在关系的结构化数据。然而,由于缺乏一个结构化和不结构化数据的统一数据库系统,开发者必须包含多个系统,并一起运行时间来回答这些问题。在这项工作中,我们建立一个名为 PandaDB 的开放源图数据库,以持续管理和查询结构化和不结构化的数据。我们首先引入一个图表数据模型模型来管理结构化和不结构化数据,然后提出一种新的查询语言,以了解图表中非结构化数据的结构化结构化结构化数据。下一步,我们开发一个新的成本模型和相关的查询模型,在结构化和不结构化的应用程序应用中要包含多个系统结构化数据结构化数据处理速度。最后,我们在不结构化数据结构化的流程中,我们提供非结构化数据结构化数据结构化数据结构化数据结构化数据处理过程中的不结构化数据结构化的不结构化数据结构化数据结构化数据结构化数据结构化数据结构化数据结构化数据结构化数据结构化数据结构化数据结构化数据结构化数据结构化数据结构化数据结构化数据结构化数据结构化的升级。