Using Set Shaping Theory to Trade RAM Accesses for CPU Computation

This paper studies Set Shaping Theory (SST) in a database-index setting under a revised interpretation: SST is not treated as a competing hashing method, but as a structural pre processing layer that can be applied before an existing indexing algorithm. The experimental question is therefore whether a method improves when it is used with SST rather than with out it. The study compares linear probing, double hashing, quadratic probing, and Robin Hood hashing against their corresponding SST-augmented variants for shaping orders K = 2,4,8. Beyond mean time, the benchmark reports mean successful probes, 95th and 99th percentile probes, collisions per stored record, and maxi mum cluster length. Experiments cover load factors from 0.75 to 0.95, database sizes from M =5000 to M =500000, query multipliers up to 200 lookups per stored record, and both uniform and hotspot query distributions. The results highlight two fundamental advantages. First, SST reduces the number of RAM accesses required during retrieval. By prevent ing clusters and long probe chains from forming at insertion time, the lookup phase requires fewer memory jumps, lower probe counts, and reduced tail latency. Second, the method introduces a new way of thinking about data storage: the data are not treated as fixed objects that must be placed passively into a table, but as reversible representations that can be struc turally adapted before being written. A small metadata tag records which transformation was selected, allowing the original key to remain recoverable and the lookup process to remain deterministic.This article is connected to the Set Shaping Theory simulator project, available online at https://sst-simulator.github.io/Set-Shaping-Theory-Simulator/ where it is possible to simulate part of the results presented in the article.

翻译：本文在数据库索引场景下，基于一种新的解读研究集合塑造理论（SST）：SST 不被视为一种竞争性的哈希方法，而是作为一种结构性的预处理层，可应用于现有索引算法之前。因此，实验研究的问题是，一种方法在使用 SST 与不使用 SST 时是否有所改进。该研究将线性探测、双重哈希、二次探测和罗宾汉哈希与它们对应的 SST 增强变体进行比较，塑造阶数 K = 2、4、8。除了平均时间外，基准测试还报告了平均成功探测次数、第95和第99百分位探测次数、每条存储记录的冲突次数以及最大簇长度。实验涵盖了从 0.75 到 0.95 的负载因子，从 M = 5000 到 M = 500000 的数据库大小，每条存储记录高达 200 次查询的查询倍数，以及均匀和热点查询分布。结果凸显了两个根本优势。首先，SST 减少了检索过程中所需的 RAM 访问次数。通过在插入时防止形成簇和长探测链，查找阶段需要更少的内存跳转、更低的探测次数和更短的尾延迟。其次，该方法引入了一种关于数据存储的新思路：数据不被视为必须被动放入表中的固定对象，而是可逆的表示，在写入前可以进行结构适应。一个小的元数据标签记录了选择的转换方式，从而确保原始键可恢复，查找过程保持确定性。本文与集合塑造理论模拟器项目相关，该项目可在 https://sst-simulator.github.io/Set-Shaping-Theory-Simulator/ 在线获取，用户可模拟本文呈现的部分结果。