Entropy coding is essential to data compression, image and video coding, etc. The Range variant of Asymmetric Numeral Systems (rANS) is a modern entropy coder, featuring superior speed and compression rate. As rANS is not designed for parallel execution, the conventional approach to parallel rANS partitions the input symbol sequence and encodes partitions with independent codecs, and more partitions bring extra overhead. This approach is found in state-of-the-art implementations such as DietGPU. It is unsuitable for content-delivery applications, as the parallelism is wasted if the decoder cannot decode all the partitions in parallel, but all the overhead is still transferred. To solve this, we propose Recoil, a parallel rANS decoding approach with decoder-adaptive scalability. We discover that a single rANS-encoded bitstream can be decoded from any arbitrary position if the intermediate states are known. After renormalization, these states also have a smaller upper bound, which can be stored efficiently. We then split the encoded bitstream using a heuristic to evenly distribute the workload, and store the intermediate states and corresponding symbol indices as metadata. The splits can then be combined simply by eliminating extra metadata entries. The main contribution of Recoil is reducing unnecessary data transfer by adaptively scaling parallelism overhead to match the decoder capability. The experiments show that Recoil decoding throughput is comparable to the conventional approach, scaling massively on CPUs and GPUs and greatly outperforming various other ANS-based codecs.
翻译:熵编码对于数据压缩、图像与视频编码等至关重要。非对称数字系统(ANS)的范围变体(rANS)是一种现代熵编码器,具有卓越的速度和压缩率。由于rANS并非为并行执行而设计,传统并行rANS方法将输入符号序列分区,并使用独立的编解码器对各分区进行编码,然而分区数量越多带来的额外开销也越大。该方案可见于DietGPU等最先进实现中,但其不适用于内容分发应用——若解码器无法并行解码所有分区,并行能力将被浪费,而所有开销仍会传输。针对此问题,我们提出Recoil——一种具有解码器自适应可扩展性的并行rANS解码方法。我们发现:若已知中间状态,单个rANS编码比特流可从任意位置开始解码。经重归一化后,这些状态的上界更小,可被高效存储。进而采用启发式方法分割编码比特流以均匀分配工作负载,并将中间状态及对应符号索引存储为元数据。通过移除多余元数据条目即可简单合并分割结果。Recoil的主要贡献在于通过自适应扩展并行开销匹配解码器能力,从而减少不必要的数据传输。实验表明,Recoil的解码吞吐量与传统方法相当,在CPU和GPU上均具备大规模可扩展性,且显著优于其他多种基于ANS的编解码器。