TensorCodec: Compact Lossy Compression of Tensors without Strong Data Assumptions

Many real-world datasets are represented as tensors, i.e., multi-dimensional arrays of numerical values. Storing them without compression often requires substantial space, which grows exponentially with the order. While many tensor compression algorithms are available, many of them rely on strong data assumptions regarding its order, sparsity, rank, and smoothness. In this work, we propose TENSORCODEC, a lossy compression algorithm for general tensors that do not necessarily adhere to strong input data assumptions. TENSORCODEC incorporates three key ideas. The first idea is Neural Tensor-Train Decomposition (NTTD) where we integrate a recurrent neural network into Tensor-Train Decomposition to enhance its expressive power and alleviate the limitations imposed by the low-rank assumption. Another idea is to fold the input tensor into a higher-order tensor to reduce the space required by NTTD. Finally, the mode indices of the input tensor are reordered to reveal patterns that can be exploited by NTTD for improved approximation. Our analysis and experiments on 8 real-world datasets demonstrate that TENSORCODEC is (a) Concise: it gives up to 7.38x more compact compression than the best competitor with similar reconstruction error, (b) Accurate: given the same budget for compressed size, it yields up to 3.33x more accurate reconstruction than the best competitor, (c) Scalable: its empirical compression time is linear in the number of tensor entries, and it reconstructs each entry in logarithmic time. Our code and datasets are available at https://github.com/kbrother/TensorCodec.

翻译：许多现实世界的数据集以张量形式表示，即数值的多维数组。若不经压缩直接存储，往往需要大量空间，且该空间随张量阶数呈指数级增长。尽管现有多种张量压缩算法，但其中多数依赖于对张量阶数、稀疏性、秩及平滑性的强数据假设。本文提出TENSORCODEC——一种针对通用张量的有损压缩算法，该算法无需严格遵循强输入数据假设。TENSORCODEC融合了三项核心思想：其一为神经张量列车分解（NTTD），通过将循环神经网络整合至张量列车分解中，增强其表达能力并缓解低秩假设带来的限制；其二是将输入张量折叠为更高阶张量，以降低NTTD所需的空间；最后，通过重排输入张量的模式索引，揭示可被NTTD利用的规律以提升逼近效果。我们在8个真实数据集上的分析与实验表明，TENSORCODEC具备以下特性：(a) 紧凑性：在相似重构误差下，压缩比最高可达最佳对比方法的7.38倍；(b) 准确性：在相同压缩规模预算下，重构精度最高比最佳对比方法提升3.33倍；(c) 可扩展性：经验压缩时间随张量元素数量呈线性增长，且每个元素的对数时间内完成重构。我们的代码与数据集公开于https://github.com/kbrother/TensorCodec。