This thesis deals with the problem of communicating and storing non-sequential data. We investigate this problem through the lens of lossless source coding, also sometimes referred to as lossless compression, from both an algorithmic and information-theoretic perspective. Lossless compression algorithms typically preserve the ordering in which data points are compressed. However, there are data types where order is not meaningful, such as collections of files, rows in a database, nodes in a graph, and, notably, datasets in machine learning applications. Compressing with traditional algorithms is possible if we pick an order for the elements and communicate the corresponding ordered sequence. However, unless the order information is somehow removed during the encoding process, this procedure will be sub-optimal, because the order contains information and therefore more bits are used to represent the source than are truly necessary. In this work we give a formal definition for non-sequential objects as random sets of equivalent sequences, which we refer to as Combinatorial Random Variables (CRVs). The definition of equivalence, formalized as an equivalence relation, establishes the non-sequential data type represented by the CRV. The achievable rates of CRVs is fully characterized as a function of the equivalence relation as well as the data distribution. The optimal rates of CRVs are achieved within the family of Random Permutation Codes (RPCs) developed in later chapters. RPCs randomly select one-of-many possible sequences that can represent the instance of the CRV. Specialized RPCs are given for the case of multisets, graphs, and partitions/clusterings, providing new algorithms for compression of databases, social networks, and web data in the JSON file format.
翻译:本论文研究非顺序数据的通信与存储问题。我们从算法与信息论的双重视角,通过无损信源编码(有时亦称为无损压缩)的框架对此问题展开研究。传统无损压缩算法通常保留数据点被压缩时的顺序信息。然而,存在某些数据类型其顺序并无实际意义,例如文件集合、数据库中的行、图中的节点,以及尤为重要的机器学习应用中的数据集。若我们为元素指定顺序并传输对应的有序序列,采用传统算法进行压缩是可行的。但除非在编码过程中以某种方式消除顺序信息,否则该过程将无法达到最优——因为顺序本身包含信息,这将导致表示信源所需的比特数超过实际必要值。本文为"非顺序对象"建立了形式化定义,将其描述为等价序列的随机集合,我们称之为组合随机变量。通过等价关系形式化的等价性定义,确立了CRV所代表的非顺序数据类型。我们完整刻画了CRV可达速率与等价关系及数据分布的函数关系。CRV的最优速率可通过后续章节提出的随机置换码族实现。RPC从可表示CRV实例的众多可能序列中随机选取其一。针对多重集、图及划分/聚类等具体情形,我们提出了专用RPC方案,为数据库、社交网络及JSON格式网络数据提供了新型压缩算法。