Enhancing In-Memory Spatial Indexing with Learned Search

Spatial data is ubiquitous. Massive amounts of data are generated every day from a plethora of sources such as billions of GPS-enabled devices (e.g., cell phones, cars, and sensors), consumer-based applications (e.g., Uber and Strava), and social media platforms (e.g., location-tagged posts on Facebook, Twitter, and Instagram). This exponential growth in spatial data has led the research community to build systems and applications for efficient spatial data processing. In this study, we apply a recently developed machine-learned search technique for single-dimensional sorted data to spatial indexing. Specifically, we partition spatial data using six traditional spatial partitioning techniques and employ machine-learned search within each partition to support point, range, distance, and spatial join queries. Adhering to the latest research trends, we tune the partitioning techniques to be instance-optimized. By tuning each partitioning technique for optimal performance, we demonstrate that: (i) grid-based index structures outperform tree-based index structures (from 1.23$\times$ to 2.47$\times$), (ii) learning-enhanced variants of commonly used spatial index structures outperform their original counterparts (from 1.44$\times$ to 53.34$\times$ faster), (iii) machine-learned search within a partition is faster than binary search by 11.79% - 39.51% when filtering on one dimension, (iv) the benefit of machine-learned search diminishes in the presence of other compute-intensive operations (e.g. scan costs in higher selectivity queries, Haversine distance computation, and point-in-polygon tests), and (v) index lookup is the bottleneck for tree-based structures, which could potentially be reduced by linearizing the indexed partitions.

翻译：空间数据无处不在。每天来自众多来源（如数十亿GPS设备，例如手机、汽车和传感器；基于消费者的应用，例如Uber和Strava；以及社交媒体平台，例如Facebook、Twitter和Instagram上带有位置标记的帖子）都会产生海量数据。空间数据的指数级增长促使研究界构建用于高效空间数据处理的系统和应用。在本研究中，我们将一种近期发展的、针对一维排序数据的机器学习搜索技术应用于空间索引。具体来说，我们使用六种传统的空间划分技术对空间数据进行划分，并在每个划分内采用机器学习搜索来支持点查询、范围查询、距离查询和空间连接查询。遵循最新研究趋势，我们调整划分技术以实现实例优化。通过调整每种划分技术以获得最佳性能，我们证明：（i）基于网格的索引结构性能优于基于树的索引结构（提升1.23倍至2.47倍），（ii）常用空间索引结构的学习增强变体性能优于其原始对应结构（速度提升1.44倍至53.34倍），（iii）在单维过滤时，划分内的机器学习搜索速度比二分搜索快11.79% - 39.51%，（iv）在存在其他计算密集型操作（例如，高选择性查询中的扫描开销、Haversine距离计算以及多边形内点测试）时，机器学习搜索的优势会减弱，以及（v）索引查找是基于树的结构的瓶颈，这可以通过对索引分区进行线性化来潜在降低。