We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those offered by other solutions in the literature. The SFDC scheme provides a flexible and simple representation geared towards either practical efficiency or compression ratios, as required. For a text of length $n$ over an alphabet of size $\sigma$ and a fixed parameter $\lambda$, the access time of the proposed encoding is proportional to the length of the character's code-word, plus an expected $\mathcal{O}((F_{\sigma - \lambda + 3} - 3)/F_{\sigma+1})$ overhead, where $F_j$ is the $j$-th number of the Fibonacci sequence. In the overall it uses $N+\mathcal{O}\big(n \left(\lambda - (F_{\sigma+3}-3)/F_{\sigma+1}\big) \right) = N + \mathcal{O}(n)$ bits, where $N$ is the length of the encoded string. Experimental results show that the performance of our scheme is, in some respects, comparable with the performance of DACs and Wavelet Tees, which are among of the most efficient schemes. In addition our scheme is configured as a \emph{computation-friendly compression} scheme, as it counts several features that make it very effective in text processing tasks. In the string matching problem, that we take as a case study, we experimentally prove that the new scheme enables results that are up to 29 times faster than standard string-matching techniques on plain texts.
翻译:我们提出一种新型变长计算友好型编码方案,命名为SFDC(支持直接访问的简洁格式)。该方案支持对压缩序列中任意元素的快速直接访问,其压缩比通常优于文献中其他方案。SFDC方案提供灵活简洁的表示方式,可根据需求分别优化实际效率或压缩比。对于长度为$n$、字母表大小为$\sigma$的文本,给定固定参数$\lambda$,该编码的访问时间与字符码字长度成正比,加上预期开销$\mathcal{O}((F_{\sigma - \lambda + 3} - 3)/F_{\sigma+1})$,其中$F_j$为斐波那契数列的第$j$项。总体占用$N+\mathcal{O}\big(n \left(\lambda - (F_{\sigma+3}-3)/F_{\sigma+1}\big) \right) = N + \mathcal{O}(n)$比特,其中$N$为编码字符串长度。实验结果表明,该方案在某些方面的性能与高效方案DAC和Wavelet Tees相当。此外,该方案被配置为"计算友好型压缩"方案,因其具备多项特性,使文本处理任务效率极高。在字符串匹配问题(作为案例研究)中,实验证明新方案比纯文本上的标准字符串匹配技术快达29倍。