Compressed Sparse Column (CSC) and Coordinate (COO) are popular compression formats for sparse matrices. However, both CSC and COO are general purpose and cannot take advantage of any of the properties of the data other than sparsity, such as data redundancy. Highly redundant sparse data is common in many machine learning applications, such as genomics, and is often too large for in-core computation using conventional sparse storage formats. In this paper, we present two extensions to CSC: (1) Value-Compressed Sparse Column (VCSC) and (2) Index- and Value-Compressed Sparse Column (IVCSC). VCSC takes advantage of high redundancy within a column to further compress data up to 3-fold over COO and 2.25-fold over CSC, without significant negative impact to performance characteristics. IVCSC extends VCSC by compressing index arrays through delta encoding and byte-packing, achieving a 10-fold decrease in memory usage over COO and 7.5-fold decrease over CSC. Our benchmarks on simulated and real data show that VCSC and IVCSC can be read in compressed form with little added computational cost. These two novel compression formats offer a broadly useful solution to encoding and reading redundant sparse data.
翻译:压缩稀疏列(CSC)和坐标(COO)是稀疏矩阵的常用压缩格式。然而,CSC 和 COO 均为通用格式,除稀疏性外无法利用数据的其他特性(例如数据冗余)。在基因组学等许多机器学习应用中,高度冗余的稀疏数据十分常见,且此类数据通常因使用传统稀疏存储格式而过大,无法进行内核计算。本文提出 CSC 的两种扩展格式:(1)值压缩稀疏列(VCSC)和(2)索引与值压缩稀疏列(IVCSC)。VCSC 利用列内的高冗余性进一步压缩数据,相比 COO 最多可实现 3 倍压缩,相比 CSC 最多实现 2.25 倍压缩,且不会对性能特征产生显著负面影响。IVCSC 通过增量编码和字节打包压缩索引数组,在内存使用上相比 COO 减少 10 倍,相比 CSC 减少 7.5 倍。在模拟和真实数据上的基准测试表明,VCSC 和 IVCSC 能以极低的附加计算成本直接读取压缩数据。这两种新型压缩格式为编码和读取冗余稀疏数据提供了广泛适用的解决方案。