Streaming computing effectively manages large-scale streaming data in real-time, making it ideal for applications such as real-time recommendations, anomaly detection, and monitoring, all of which require immediate processing. In this context, the multi-way stream join operator is crucial, as it combines multiple data streams into a single operator, providing deeper insights through the integration of information from various sources. However, challenges related to memory limitations can arise when processing long state-based data streams, particularly in the area of streaming SQL. In this paper, we propose a streaming SQL multi-way stream join method that utilizes the LSM-Tree to address this issue. We first introduce a multi-way stream join operator called UMJoin, which employs an LSM-Tree state backend to leverage disk storage, thereby increasing the capacity for storing multi-way stream states beyond what memory can accommodate. Subsequently, we develop a method for converting execution plans, referred to as TSC, specifically for the UMJoin operator. This method identifies binary join tree patterns and generates corresponding multi-way stream join nodes, enabling us to transform execution plans based on binary joins into those that incorporate UMJoin nodes. This transformation facilitates the application of the UMJoin operator in streaming SQL. Experiments with the TPC-DS dataset demonstrate that the UMJoin operator can effectively process long state-based data streams, even with limited memory. Furthermore, tests on execution plan conversion for multi-way stream join queries using the TPC-H benchmark confirm the effectiveness of the TSC method in executing these conversions.
翻译:流式计算能够有效实时处理大规模流式数据,特别适用于实时推荐、异常检测与监控等需要即时处理的应用场景。在此背景下,多路流连接算子至关重要,它能够将多个数据流整合至单一算子中,通过融合多源信息提供更深入的洞察。然而,在处理基于长状态的数据流时,尤其是在流式SQL领域,可能会面临内存限制带来的挑战。本文提出一种利用LSM-Tree解决该问题的流式SQL多路流连接方法。我们首先介绍一种名为UMJoin的多路流连接算子,该算子采用LSM-Tree状态后端以利用磁盘存储,从而将多路流状态的存储容量扩展至内存可容纳范围之外。随后,我们针对UMJoin算子开发了一种执行计划转换方法(称为TSC)。该方法通过识别二元连接树模式并生成对应的多路流连接节点,能够将基于二元连接的执行计划转换为包含UMJoin节点的执行计划,从而促进UMJoin算子在流式SQL中的应用。基于TPC-DS数据集的实验表明,UMJoin算子即使在有限内存条件下也能有效处理基于长状态的数据流。此外,利用TPC-H基准对多路流连接查询的执行计划转换测试证实了TSC方法在执行此类转换方面的有效性。