We extend classical rate-distortion theory to a discrete classification setting with three resources: tag rate $L$ (bits of storage per entity), identification cost $W$ (queries to determine class membership), and distortion $D$ (misidentification probability). We prove an information barrier: when distinct classes share identical attribute profiles (i.e., the attribute-profile map $π$ is not injective on classes), zero-error identification from attribute queries alone is impossible. We characterize the unique Pareto-optimal zero-error point in the $(L,W,D)$ tradeoff space: a nominal tag of length $L=\lceil\log_2 k\rceil$ bits for $k$ classes yields $W=O(1)$ and $D=0$. Without tags ($L=0$), zero-error identification requires $W=Ω(d)$ attribute queries, where $d$ is the distinguishing dimension; in the worst case $d=n$ (the ambient attribute count), giving $W=Ω(n)$. In the presence of attribute collisions, any tag-free scheme incurs $D>0$. Conversely, in any information-barrier domain, any scheme achieving $D=0$ requires $L\ge \log_2 k$ bits; this is tight. We show minimal sufficient query sets form the bases of a matroid, so the distinguishing dimension is well-defined, connecting to zero-error source coding via graph entropy. We instantiate the theory to type systems, databases, and biological taxonomy. All results are machine-checked in Lean 4 (6000+ lines, 0 sorry).
翻译:我们将经典率失真理论拓展至具有三种资源的离散分类场景:标签速率$L$(每个实体的存储比特数)、识别成本$W$(确定类别归属所需的查询次数)和失真$D$(误识别概率)。我们证明了一个信息障碍:当不同类别具有完全相同的属性特征(即属性特征映射$π$在类别上非单射)时,仅通过属性查询实现零误差识别是不可能的。我们刻画了$(L,W,D)$权衡空间中唯一的帕累托最优零误差点:对于$k$个类别,长度为$L=\lceil\log_2 k\rceil$比特的名义标签可实现$W=O(1)$和$D=0$。在无标签情况下($L=0$),零误差识别需要$W=Ω(d)$次属性查询,其中$d$为区分维度;最坏情况下$d=n$(环境属性总数),此时$W=Ω(n)$。当存在属性碰撞时,任何无标签方案必然导致$D>0$。反之,在任何存在信息障碍的领域中,任何实现$D=0$的方案都需要$L\ge \log_2 k$比特;该界限是紧的。我们证明最小充分查询集构成拟阵的基,因此区分维度是良定义的,并通过图熵与零误差信源编码理论建立联系。我们将该理论实例化于类型系统、数据库和生物分类学中。所有结果均在Lean 4中完成机器验证(6000+行代码,0处未证明声明)。