We propose a general framework for end-to-end learning of data structures. Our framework adapts to the underlying data distribution and provides fine-grained control over query and space complexity. Crucially, the data structure is learned from scratch, and does not require careful initialization or seeding with candidate data structures/algorithms. We first apply this framework to the problem of nearest neighbor search. In several settings, we are able to reverse-engineer the learned data structures and query algorithms. For 1D nearest neighbor search, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble k-d trees in some regimes, while in others, they have elements of locality-sensitive hashing. The model can also learn useful representations of high-dimensional data and exploit them to design effective data structures. We also adapt our framework to the problem of estimating frequencies over a data stream, and believe it could also be a powerful discovery tool for new problems.
翻译:我们提出了一种用于端到端学习数据结构的通用框架。该框架能够适应底层数据分布,并对查询复杂度和空间复杂度提供细粒度控制。关键在于,数据结构是从零开始学习的,不需要精心初始化或用候选数据结构/算法进行种子设定。我们首先将该框架应用于最近邻搜索问题。在多种设定下,我们能够对学习到的数据结构和查询算法进行逆向工程分析。对于一维最近邻搜索,该模型发现了最优的分布(非)依赖算法,例如二分搜索和插值搜索的变体。在更高维度中,该模型在某些区域学习到类似于k-d树的解决方案,而在其他区域则具有局部敏感哈希的特征。该模型还能学习高维数据的有用表示,并利用它们设计高效的数据结构。我们还将该框架应用于数据流频率估计问题,并相信它也能成为解决新问题的强大发现工具。