Collecting data, extracting value, and combining insights from relational and context-rich multi-modal sources in data processing pipelines presents a challenge for traditional relational DBMS. While relational operators allow declarative and optimizable query specification, they are limited to data transformations unsuitable for capturing or analyzing context. On the other hand, representation learning models can map context-rich data into embeddings, allowing machine-automated context processing but requiring imperative data transformation integration with the analytical query. To bridge this dichotomy, we present a context-enhanced relational join and introduce an embedding operator composable with relational operators. This enables hybrid relational and context-rich vector data processing, with algebraic equivalences compatible with relational algebra and corresponding logical and physical optimizations. We investigate model-operator interaction with vector data processing and study the characteristics of the E-join operator. Using an example of string embeddings, we demonstrate enabling hybrid context-enhanced processing on relational join operators with vector embeddings. The importance of holistic optimization, from logical to physical, is demonstrated in an order of magnitude execution time improvement.
翻译:在数据处理管线中,从关系型与上下文丰富的多模态数据源中采集数据、提取价值并整合洞察,这对传统关系型数据库管理系统(DBMS)构成挑战。尽管关系运算符支持声明式且可优化的查询规范,但其局限于不适合捕捉或分析上下文的数据变换。另一方面,表征学习模型可将上下文丰富的数据映射为嵌入向量,实现机器自动化的上下文处理,但需要将命令式数据变换与分析查询集成。为弥合这一二元对立,我们提出一种上下文增强的关系连接操作,并引入可与关系运算符组合的嵌入运算符。这实现了混合关系型与上下文丰富向量数据的处理,其代数等价关系与关系代数兼容,并对应逻辑与物理优化。我们研究了模型-运算符在向量数据处理中的交互,并分析了E-连接运算符的特性。以字符串嵌入为例,我们展示了在关系连接运算符上实现混合上下文增强处理的能力。从逻辑到物理的整体优化重要性体现在执行时间数量级的提升上。