Assessing whether two datasets are distributionally consistent is central to modern scientific analysis, particularly as generative artificial intelligence produces synthetic data whose fidelity must be validated against real observations in increasingly high-dimensional settings. Existing approaches are typically relative: they determine whether one dataset is more consistent with a reference than another, but do not provide a physically grounded absolute standard for fidelity. We propose an information-theoretic approach in which lossless compression via arithmetic coding provides an operational measure of dataset fidelity under a physics-informed probabilistic representation. Datasets sharing the same underlying physical correlations admit comparable optimal descriptions, while discrepancies-arising from miscalibration, mismodeling, or bias-manifest as an irreducible excess in codelength relative to the Shannon-optimal limit defined by the physics itself. This excess codelength defines an absolute fidelity metric, quantified directly in bits. Unlike conventional measures, which lack an intrinsic scale, zero excess provides a well-defined and physically meaningful target corresponding to consistency with the underlying distribution. We show that this metric is global, interpretable, additive across components, and asymptotically optimal, with differences in codelength corresponding to differences in expected negative log-likelihood under a common reference model. As a byproduct, our approach achieves improved compression relative to standard general-purpose algorithms such as gzip. These results establish arithmetic coding not merely as a compression tool, but as a measurement instrument for absolute, physics-grounded assessment of distributional fidelity.
翻译:暂无翻译