Recent research on learned indexes has created a new perspective for indexes as models that map keys to their respective storage locations. These learned indexes are created to approximate the cumulative distribution function of the key set, where using only a single model may have limited accuracy. To overcome this limitation, a typical method is to use multiple models, arranged in a hierarchical manner, where the query performance depends on two aspects: (i) traversal time to find the correct model and (ii) search time to find the key in the selected model. Such a method may cause some key space regions that are difficult to model to be placed at deeper levels in the hierarchy. To address this issue, we propose an alternative method that modifies the key space as opposed to any structural or model modifications. This is achieved through making the key set more learnable (i.e., smoothing the distribution) by inserting virtual points. Furthermore, we develop an algorithm named CSV to integrate our virtual point insertion method into existing learned indexes, reducing both their traversal and search time. We implement CSV on state-of-the-art learned indexes and evaluate them on real-world datasets. Extensive experimental results show significant query performance improvement for the keys in deeper levels of the index structures at a low storage cost.
翻译:近期关于学习索引的研究为索引技术开辟了新视角,将其视为将键映射至对应存储位置的模型。这类学习索引旨在逼近键集的累积分布函数,但仅使用单一模型可能精度有限。为克服此限制,典型方法是采用分层组织的多模型结构,其查询性能取决于两方面:(i) 定位正确模型的遍历时间;(ii)在选定模型中查找键的搜索时间。此类方法可能导致某些难以建模的键空间区域被置于层次结构的较深层级。为解决该问题,我们提出一种替代方法:通过修改键空间而非调整结构或模型来提升性能。具体而言,我们通过插入虚拟点使键集更具可学习性(即平滑分布)。此外,我们开发了名为CSV的算法,将虚拟点插入方法与现有学习索引相集成,从而同时减少遍历时间和搜索时间。我们在前沿学习索引上实现了CSV,并在真实数据集上进行评估。大量实验结果表明,该方法能以较低的存储成本显著提升索引结构深层键的查询性能。