Identification capacity and rate-query tradeoffs in classification systems

from arxiv, 15 pages, 1 table. Lean 4 formalization (6,100+ lines, 0 sorry) included in source and archived at https://doi.org/10.5281/zenodo.18261188

We study a one-shot identification analogue of rate-distortion for discrete classification under three resources: tag rate L (bits of side information stored per entity), identification cost W (attribute-membership queries per identification, excluding global preprocessing and amortized caching), and distortion D (misclassification probability). The question is to characterize achievable triples (L,W,D) when a decoder must recover an entity's class from limited observations. Zero-error barrier. If two distinct classes induce the same attribute profile, then the observation pi(V) is identical for both and no decoder can identify the class from attribute queries alone. Thus, if the profile map pi is not injective on classes, zero-error identification without tags is impossible (a zero-error feasibility threshold). Achievability and converse at D=0. With k classes, nominal tags of L = ceil(log2 k) bits enable O(1) identification cost with D=0. Conversely, any scheme with D=0 must satisfy L >= log2 k bits (tight). Without tags (L=0), identification requires Omega(n) queries in the worst case and may incur D>0. Combinatorial structure. Minimal sufficient query families form the bases of a matroid; the induced distinguishing dimension is well-defined and links to zero-error source coding via graph entropy. We illustrate implications for type systems, databases, and biological taxonomy. All results are mechanized in Lean4 (6000+ lines, 0 sorry).

翻译：本研究针对离散分类问题，提出一种单次识别的率失真类比框架，该框架涉及三种资源：标签率L（每个实体存储的边信息比特数）、识别成本W（每次识别所需的属性成员查询次数，不包括全局预处理和摊销缓存）以及失真D（误分类概率）。核心问题在于刻画当解码器必须从有限观测中恢复实体类别时，可实现的(L,W,D)三元组。零误差屏障：若两个不同类别产生相同的属性剖面，则观测π(V)对两者完全相同，仅凭属性查询的解码器无法识别类别。因此，若剖面映射π在类别上非单射，则无标签的零误差识别不可行（即零误差可行性阈值）。D=0时的可达性与逆定理：对于k个类别，L = ceil(log₂ k)比特的名义标签可实现O(1)识别成本且D=0。反之，任何D=0的方案必须满足L ≥ log₂ k比特（紧界）。无标签时（L=0），最坏情况下识别需要Ω(n)次查询，且可能导致D>0。组合结构：最小充分查询族构成拟阵的基；由此导出的区分维度定义良好，并通过图熵与零误差信源编码理论相关联。我们阐明了该框架对类型系统、数据库和生物分类学的应用意义。所有结果均在Lean4中形式化验证（6000+行代码，0处未证声明）。