The end-to-end lookup latency of a hierarchical index -- such as a B-tree or a learned index -- is determined by its structure such as the number of layers, the kinds of branching functions appearing in each layer, the amount of data we must fetch from layers, etc. Our primary observation is that by optimizing those structural parameters (or designs) specifically to a target system's I/O characteristics (e.g., latency, bandwidth), we can offer a faster lookup compared to the ones that are not optimized. Can we develop a systematic method for finding those optimal design parameters? Ideally, the method must have the potential to generate almost any existing index or a novel combination of them for the fastest possible lookup. In this work, we present new data and an I/O-aware index builder (called AirIndex) that can find high-speed hierarchical index designs in a principled way. Specifically, AirIndex minimizes an objective function expressing the end-to-end latency in terms of various designs -- the number of layers, types of layers, and more -- for given data and a storage profile, using a graph-based optimization method purpose-built to address the computational challenges rising from the inter-dependencies among index layers and the exponentially many candidate parameters in a large search space. Our empirical studies confirm that AirIndex can find optimal index designs, build optimal indexes within the times comparable to existing methods, and deliver up to 4.1x faster lookup than a lightweight B-tree library (LMDB), 3.3x--46.3x faster than state-of-the-art learned indexes (RMI/CDFShop, PGM-Index, ALEX/APEX, PLEX), and 2.0 faster than Data Calculator's suggestion on various dataset and storage settings.
翻译:摘要:分层索引(如B树或学习型索引)的端到端查找延迟由其结构决定,包括层数、每层分支函数的类型、以及需从各层提取的数据量等因素。我们的核心发现是:通过针对目标系统的I/O特性(如延迟、带宽)优化这些结构参数(或设计),相较于非优化的索引,能够实现更快速的查找。是否能够开发一种系统化方法来寻找这些最优设计参数?理想情况下,该方法应具备生成几乎所有现有索引或其新型组合以实现最快查找的潜力。本研究提出了一套新数据及一种I/O感知的索引构建器(称为AirIndex),能以原则化方式发现高速分层索引设计。具体而言,AirIndex通过针对给定数据与存储配置文件构建目标函数,该函数以多种设计变量(层数、层类型等)表达端到端延迟,并采用基于图的优化方法——专门为应对索引层间相互依赖及大规模搜索空间中呈指数增长的候选参数所带来的计算挑战而设计——实现延迟最小化。实验研究表明:AirIndex能够发现最优索引设计,在构建时间与现有方法相当的前提下,相比轻量级B树库(LMDB)实现高达4.1倍的查找加速,相比最先进的学习型索引(RMI/CDFShop、PGM-Index、ALEX/APEX、PLEX)实现3.3倍至46.3倍的加速,并在多种数据集与存储配置下,比Data Calculator的建议方案快2.0倍。