Due to the significant increase in the size of spatial data, it is essential to use distributed parallel processing systems to efficiently analyze spatial data. In this paper, we first study learned spatial data partitioning, which effectively assigns groups of big spatial data to computers based on locations of data by using machine learning techniques. We formalize spatial data partitioning in the context of reinforcement learning and develop a novel deep reinforcement learning algorithm. Our learning algorithm leverages features of spatial data partitioning and prunes ineffective learning processes to find optimal partitions efficiently. Our experimental study, which uses Apache Sedona and real-world spatial data, demonstrates that our method efficiently finds partitions for accelerating distance join queries and reduces the workload run time by up to 59.4%.
翻译:由于空间数据规模显著增长,有必要采用分布式并行处理系统来高效分析空间数据。本文首次研究基于机器学习的空间数据分区方法,该方法通过机器学习技术根据数据位置将大规模空间数据有效分配给计算机。我们在强化学习框架下形式化空间数据分区问题,并开发了一种新型深度强化学习算法。该学习算法充分利用空间数据分区的特征,通过剪枝无效的学习过程,高效地搜索最优分区方案。基于Apache Sedona和真实空间数据的实验表明,我们的方法能够高效找到加速距离连接查询的分区方案,并将工作负载运行时间最多减少59.4%。