Symbolic systems operate over precise identities: variables denote specific objects, pointers target precise memory locations, and database keys refer to singular records. Neural embeddings generalize by compressing away semantic detail, but this compression creates collision ambiguity: multiple distinct entities can share the same representation value. Exact identity recovery requires additional information precisely when representation fibers have size greater than one. The residual cost is controlled by a single combinatorial object: the collision-fiber geometry of the representation map $π$. Let $A_π=\max_u |π^{-1}(u)|$ be the largest collision fiber. The finite laws include a tight fixed-length converse $L \ge \log_2 A_π$, an exact finite-block scaling law, a pointwise adaptive budget $\lceil \log_2 |π^{-1}(u)|\rceil$, and an exact fiberwise rate-distortion law for arbitrary finite sources via recoverable-mass decomposition across representation fibers. The uniform single-block formula $D^\star(L)=\max(0,1-2^L/a)$ appears as a closed-form special case when all mass lies on one collision block, where $a = A_π$ is the collision block size. The same fiber geometry determines query complexity and canonical structure for distinguishing families. Because this residual ambiguity is structural rather than representation-specific, symbolic identity mechanisms (handles, keys, pointers, nominal tags) are the necessary system-level complement to any non-injective semantic representation. All main results are machine-checked in Lean 4.
翻译:符号系统在精确标识符上运行:变量指代特定对象、指针指向精确内存位置、数据库键引用唯一记录。神经嵌入通过压缩语义细节实现泛化,但这种压缩会产生冲突歧义:多个不同实体可能共享相同的表示值。当表示纤维的大小大于1时,精确恒等恢复需要额外信息。残余代价由单一组合对象控制:表示映射$π$的冲突纤维几何。令$A_π=\max_u |π^{-1}(u)|$为最大冲突纤维。有限定律包括紧致的固定长度下界$L \ge \log_2 A_π$、精确的有限块缩放律、逐点自适应预算$\lceil \log_2 |π^{-1}(u)|\rceil$,以及通过表示纤维上的可恢复质量分解得到的任意有限源的精确纤维率失真定律。当所有质量落在一个冲突块上时,均匀单块公式$D^\star(L)=\max(0,1-2^L/a)$以闭式特例形式出现,其中$a = A_π$是冲突块大小。相同纤维几何决定了区分族的查询复杂度和规范结构。由于这种残余歧义是结构性的而非表示特异性的,符号恒等机制(句柄、键、指针、名词性标签)是对任何非单射语义表示的必要系统级补充。所有主要结果均在Lean 4中经过机器验证。