Modern data applications increasingly involve heterogeneous data managed in different models and stored across disparate database engines, often deployed as separate installs. Limited research has addressed cross-model query processing in federated environments. This paper takes a step toward bridging this gap by: (1) formally defining a class of cross-model join queries between a graph store and a relational store by proposing a unified algebra; (2) introducing one real-world benchmark and four semi-synthetic benchmarks to evaluate such queries; and (3) proposing a lightweight middleware, MICRO, for efficient query execution. At the core of MICRO is CMLero, a learning-to-rank-based query optimizer that selects efficient execution plans without requiring exact cost estimation. By avoiding the need to materialize or convert all data into a single model, which is often infeasible due to third-party data control or cost, MICRO enables native querying across heterogeneous systems. Experimental results on the benchmark workloads demonstrate that MICRO outperforms the state-of-the-art federated relational system XDB by up to 2.1x in total runtime across the full test set. On the 93 test queries of real-world benchmark, 14 queries achieve over 100 speedup, including 4 queries with more than 100x speedup; however, 4 queries experienced slowdowns of over 5 seconds, highlighting opportunities for future improvement of MICRO. Further comparisons show that CMLero consistently outperforms rule-based and regression-based optimizers, highlighting the advantage of learning-to-rank in complex cross-model optimization.
翻译:现代数据应用日益涉及异构数据,这些数据以不同模型管理并存储于异构数据库引擎中,且通常以独立部署形式存在。目前针对联邦环境下跨模型查询处理的研究较为有限。本文通过以下工作向填补这一空白迈进一步:(1) 通过提出统一代数形式化定义图存储与关系存储间的跨模型连接查询类别;(2) 引入一个真实场景基准测试与四个半合成基准测试以评估此类查询;(3) 提出轻量级中间件MICRO以实现高效查询执行。MICRO的核心是CMLero——一种基于学习排序的查询优化器,无需精确代价估计即可选择高效执行计划。通过避免将所有数据物化或转换为单一模型(这在第三方数据控制或成本限制下通常不可行),MICRO实现了跨异构系统的原生查询。在基准测试工作负载上的实验结果表明,MICRO在整个测试集上的总运行时间比当前最先进的联邦关系系统XDB快达2.1倍。在真实场景基准测试的93个查询中,14个查询实现超过100倍加速,其中4个查询加速超过100倍;然而有4个查询出现超过5秒的延迟,这为MICRO的未来改进指明了方向。进一步对比显示,CMLero在多数情况下优于基于规则和基于回归的优化器,凸显了学习排序方法在复杂跨模型优化中的优势。