Entropy coding is essential to data compression, image and video coding, etc. The Range variant of Asymmetric Numeral Systems (rANS) is a modern entropy coder, featuring superior speed and compression rate. As rANS is not designed for parallel execution, the conventional approach to parallel rANS partitions the input symbol sequence and encodes partitions with independent codecs, and more partitions bring extra overhead. This approach is found in state-of-the-art implementations such as DietGPU. It is unsuitable for content-delivery applications, as the parallelism is wasted if the decoder cannot decode all the partitions in parallel, but all the overhead is still transferred. To solve this, we propose Recoil, a parallel rANS decoding approach with decoder-adaptive scalability. We discover that a single rANS-encoded bitstream can be decoded from any arbitrary position if the intermediate states are known. After renormalization, these states also have a smaller upper bound, which can be stored efficiently. We then split the encoded bitstream using a heuristic to evenly distribute the workload, and store the intermediate states and corresponding symbol indices as metadata. The splits can then be combined simply by eliminating extra metadata entries. The main contribution of Recoil is reducing unnecessary data transfer by adaptively scaling parallelism overhead to match the decoder capability. The experiments show that Recoil decoding throughput is comparable to the conventional approach, scaling massively on CPUs and GPUs and greatly outperforming various other ANS-based codecs.
翻译:熵编码对于数据压缩、图像视频编码等至关重要。非对称数字系综(rANS)的范围变体是一种现代熵编码器,具有优越的速度和压缩率。由于rANS并非为并行执行设计,传统并行rANS方法通过划分输入符号序列并使用独立编解码器编码各分区,而更多分区会带来额外开销。这种方案见于DietGPU等先进实现中。然而该方法不适用于内容分发应用——若解码器无法并行解码所有分区,则并行性被浪费,但所有开销仍需传输。为解决此问题,我们提出Recoil,一种具有解码器自适应可扩展性的并行rANS解码方法。我们发现,若已知中间状态,单个rANS编码比特流可从任意位置开始解码。经过重归一化后,这些状态还具有更小的上界,可高效存储。随后,我们通过启发式方法分割编码比特流以均匀分配工作负载,并将中间状态及对应符号索引作为元数据存储。通过移除多余元数据条目即可简单合并各分割片段。Recoil的主要贡献在于通过自适应调整并行开销以匹配解码器能力,从而减少不必要的数据传输。实验表明,Recoil的解码吞吐量与常规方法相当,能在CPU和GPU上实现大规模扩展,并显著优于其他基于ANS的编解码器。