Many real-world datasets are represented as tensors, i.e., multi-dimensional arrays of numerical values. Storing them without compression often requires substantial space, which grows exponentially with the order. While many tensor compression algorithms are available, many of them rely on strong data assumptions regarding its order, sparsity, rank, and smoothness. In this work, we propose TENSORCODEC, a lossy compression algorithm for general tensors that do not necessarily adhere to strong input data assumptions. TENSORCODEC incorporates three key ideas. The first idea is Neural Tensor-Train Decomposition (NTTD) where we integrate a recurrent neural network into Tensor-Train Decomposition to enhance its expressive power and alleviate the limitations imposed by the low-rank assumption. Another idea is to fold the input tensor into a higher-order tensor to reduce the space required by NTTD. Finally, the mode indices of the input tensor are reordered to reveal patterns that can be exploited by NTTD for improved approximation. Our analysis and experiments on 8 real-world datasets demonstrate that TENSORCODEC is (a) Concise: it gives up to 7.38x more compact compression than the best competitor with similar reconstruction error, (b) Accurate: given the same budget for compressed size, it yields up to 3.33x more accurate reconstruction than the best competitor, (c) Scalable: its empirical compression time is linear in the number of tensor entries, and it reconstructs each entry in logarithmic time. Our code and datasets are available at https://github.com/kbrother/TensorCodec.
翻译:许多真实世界数据集以张量形式表示,即数值的多维数组。未压缩存储通常需要大量空间,且该空间随阶数指数增长。尽管现有多种张量压缩算法,但许多依赖关于其阶数、稀疏性、秩和光滑性的强数据假设。本文提出TENSORCODEC,一种无需严格遵循强输入数据假设的通用张量有损压缩算法。TENSORCODEC融合三个关键思想:第一是神经张量列车分解(NTTD),通过将循环神经网络集成到张量列车分解中,增强其表达能力,缓解低秩假设带来的局限性;第二是将输入张量折叠为高阶张量,减少NTTD所需空间;最后,对输入张量的模式索引重新排序,以揭示NTTD可进一步利用的模式,提升近似效果。我们在8个真实数据集上的分析与实验表明,TENSORCODEC具有:(a)紧凑性:在相似重建误差下,压缩比最优竞争方法紧凑高达7.38倍;(b)精确性:在给定相同压缩大小预算下,重建精度比最优竞争方法高至3.33倍;(c)可扩展性:其实验压缩时间与张量元素数量呈线性关系,且每个元素的对数时间内完成重建。我们的代码和数据集发布于https://github.com/kbrother/TensorCodec。