DeepMapping: The Case for Learned Data Mapping for Compression and Efficient Query Processing

Storing tabular data in a way that balances storage and query efficiencies is a long standing research question in the database community. While there are several lossless compression techniques in the literature, in this work we argue and show that a novel Deep Learned Data Mapping (or DeepMapping) abstraction, which relies on the impressive memorization capabilities of deep neural networks, can provide better storage cost, better latency, and better run-time memory footprint, all at the same time. Our proposed DeepMapping abstraction transforms a data set into multiple key-value mappings and constructs a multi-tasking neural network model that outputs the corresponding values for a given input key. In order to deal with the memorization errors, DeepMapping couples the learned neural network with a light-weight auxiliary data structure capable of correcting errors. The auxiliary structure further enables DeepMapping to efficiently deal with insertions, deletions, and updates, without having to re-train the mapping. Since the shape of the network has a significant impact on the overall size of the DeepMapping structure, we further propose a multi-task hybrid architecture search strategy to identify DeepMapping architectures that strike a desirable balance among memorization capacity, size, and efficiency. Extensive experiments with synthetic and benchmark datasets, including TPC-H and TPC-DS, demonstrated that the proposed DeepMapping approach can significantly reduce the latency of the key-based queries, while simultaneously improving both offline and run-time storage requirements against several cutting-edge competitors.

翻译：在数据库领域，如何以兼顾存储效率与查询性能的方式存储表格数据是一个长期的研究课题。尽管现有文献已提出多种无损压缩技术，但本文通过论证与实验表明：基于深度神经网络卓越记忆能力的深度学习型数据映射（DeepMapping）抽象方法，能够同时实现更优的存储开销、更低的延迟和更小的运行时内存占用。我们提出的DeepMapping抽象将数据集转化为多重键值映射，并构建多任务神经网络模型以输出给定输入键对应的值。针对记忆误差问题，DeepMapping将学习型神经网络与轻量级纠错辅助数据结构相结合。该辅助结构使DeepMapping能够高效处理数据的增删改操作，而无需重新训练映射模型。由于网络结构对整体存储规模具有显著影响，我们进一步提出多任务混合架构搜索策略，旨在识别能在记忆容量、模型规模与运行效率间取得理想平衡的DeepMapping架构。在包含TPC-H和TPC-DS在内的合成数据集与基准数据集上的大量实验表明，相较于多个前沿竞争方法，提出的DeepMapping方法能显著降低基于键的查询延迟，同时同步优化离线存储与运行时存储需求。