From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence

Can we learn more from data than existed in the generating process itself? Can new and useful information be constructed from merely applying deterministic transformations to existing data? Can the learnable content in data be evaluated without considering a downstream task? On these questions, Shannon information and Kolmogorov complexity come up nearly empty-handed, in part because they assume observers with unlimited computational capacity and do not target the useful information content. In this work, we identify and exemplify three seeming paradoxes in information theory: (1) information cannot be increased by deterministic transformations; (2) information is independent of the order of data; (3) likelihood modeling is merely distribution matching. To shed light on the tension between these results and modern practice, and to quantify the value of data, we introduce epiplexity, a formalization of information capturing what computationally bounded observers can learn from data. Epiplexity captures the structural content in data while excluding time-bounded entropy, the random unpredictable content exemplified by pseudorandom number generators and chaotic dynamical systems. With these concepts, we demonstrate how information can be created with computation, how it depends on the ordering of the data, and how likelihood modeling can produce more complex programs than present in the data generating process itself. We also present practical procedures to estimate epiplexity which we show capture differences across data sources, track with downstream performance, and highlight dataset interventions that improve out-of-distribution generalization. In contrast to principles of model selection, epiplexity provides a theoretical foundation for data selection, guiding how to select, generate, or transform data for learning systems.

翻译：我们能否从数据中学到比生成过程本身更多的东西？仅通过对现有数据应用确定性变换，能否构造出新的有用信息？能否在不考虑下游任务的情况下评估数据中的可学习内容？对于这些问题，香农信息论和柯尔莫哥洛夫复杂性几乎无法给出答案，部分原因在于它们假设观察者具有无限计算能力，且不关注有用信息内容。本研究识别并例证了信息论中三个看似矛盾的现象：（1）确定性变换无法增加信息；（2）信息与数据顺序无关；（3）似然建模仅是分布匹配。为阐明这些结论与现代实践之间的张力，并量化数据价值，我们引入复杂性——一种形式化的信息度量，用于刻画计算受限观察者能从数据中学习的内容。复杂性捕捉数据中的结构化内容，同时排除时间受限熵（即伪随机数生成器和混沌动力系统所例示的随机不可预测内容）。基于这些概念，我们论证了信息如何通过计算被创造、如何依赖于数据排序，以及似然建模如何产生比数据生成过程本身更复杂的程序。我们还提出了估计复杂性的实用方法，实验表明这些方法能捕捉不同数据源的差异、跟踪下游性能表现，并突出能改善分布外泛化能力的数据集干预措施。与模型选择原则相比，复杂性为数据选择提供了理论基础，指导如何为学习系统选择、生成或转换数据。