Succinct tries are powerful string dictionaries because of their low memory footprint and fast query performance. However, existing succinct trie implementations face two key challenges to spatial locality: 1) they incur unnecessary cache misses during queries, especially during trie navigation operations, and 2) they waste significant space when the data contains many unary paths. We propose C^2, a set of two techniques: C_1 introduces a more cache-friendly layout for the \bv underlying succinct tries, and C_2 compresses redundant unary paths. We thoroughly redesign three state-of-the-art succinct tries: FST, CoCo-trie, and Marisa, producing C^2-FST, C^2-CoCo, and C^2-Marisa. Experiments on six diverse datasets show that the C_1 optimization improves query performance by 1.58x, 1.12x, and 1.42x, respectively, compared to the original FST, CoCo-trie, and Marisa. Furthermore, the C_2 optimization achieves a 1.3x smaller memory footprint on average. The succinct tries optimized with both aspects of C^2 achieve better space-time tradeoffs than their original versions and other state-of-the-art succinct tries, while using significantly less space than non-succinct tries like ART and C-ART.
翻译:简洁字典树因其低内存占用和快速查询性能而成为强大的字符串字典。然而,现有简洁字典树实现在空间局部性方面面临两个关键挑战:1) 在查询过程中,尤其是字典树导航操作中,会引发不必要的缓存未命中;2) 当数据包含大量单一路径时,会浪费大量空间。我们提出C^2,包含两项技术:C_1为底层简洁字典树引入了更缓存友好的布局,C_2压缩冗余的单一路径。我们彻底重新设计了三种最先进的简洁字典树:FST、CoCo-trie和Marisa,生成了C^2-FST、C^2-CoCo和C^2-Marisa。在六个不同数据集上的实验表明,与原始FST、CoCo-trie和Marisa相比,C_1优化分别将查询性能提升了1.58倍、1.12倍和1.42倍。此外,C_2优化平均实现了1.3倍的内存占用缩减。经过C^2两方面优化后的简洁字典树,相比其原始版本及其他最先进的简洁字典树,实现了更优的空间-时间权衡,同时相比ART和C-ART等非简洁字典树,占用了显著更少的空间。