Assessing whether two datasets are distributionally consistent has become a central theme in modern scientific analysis, particularly as generative artificial intelligence is increasingly used to produce synthetic datasets whose fidelity must be rigorously validated against the original data on which they are trained, a task made more challenging by the continued growth in data volume and problem dimensionality. In this work, we propose the use of arithmetic coding to provide a lossless and invertible compression of datasets under a physics-informed probabilistic representation. Datasets that share the same underlying physical correlations admit comparable optimal descriptions, while discrepancies in those correlations-arising from miscalibration, mismodeling, or bias-manifest as an irreducible excess in code length. This excess codelength defines an operational fidelity metric, quantified directly in bits through differences in achievable compression length relative to a physics-inspired reference distribution. We demonstrate that this metric is global, interpretable, additive across components, and asymptotically optimal in the Shannon sense. Moreover, we show that differences in codelength correspond to differences in expected negative log-likelihood evaluated under the same physics-informed reference model. As a byproduct, we also demonstrate that our compression approach achieves a higher compression ratio than traditional general-purpose algorithms such as gzip. Our results establish lossless, physics-aware compression based on arithmetic coding not as an end in itself, but as a measurement instrument for testing the fidelity between datasets.
翻译:评估两个数据集是否具有分布一致性已成为现代科学分析的核心议题,尤其是在生成式人工智能日益被用于产生合成数据集的背景下。这些合成数据集的保真度必须相对于其训练所用的原始数据进行严格验证,而数据量和问题维度的持续增长使得这一任务更具挑战性。在本工作中,我们提出利用算术编码,在物理信息概率表示下对数据集进行无损且可逆的压缩。共享相同底层物理关联的数据集允许可比较的最优描述,而那些由于校准错误、建模偏差或系统偏差导致关联性不一致的数据集,则会表现为编码长度的不可约冗余。这种冗余编码长度定义了一种可操作的保真度度量,通过相对于物理启发的参考分布可达到的压缩长度差异,直接以比特为单位进行量化。我们证明该度量具有全局性、可解释性、跨分量可加性,并且在香农意义下是渐近最优的。此外,我们表明编码长度的差异对应于在同一物理信息参考模型下评估的期望负对数似然的差异。作为副产品,我们还证明我们的压缩方法比传统的通用算法(如gzip)实现了更高的压缩比。我们的研究结果表明,基于算术编码的无损物理感知压缩本身并非目的,而是作为一种测量工具,用于检验数据集间的保真度。