Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2's reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.
翻译:摘要:生物学和物理学的基础模型优化了预测准确性,但它们的内部表征系统性地未能保留所建模系统的连续几何结构。我们识别出根本原因:几何对齐税,这是一种将连续流形强制通过离散类别瓶颈所固有的代价。对合成动力系统的受控消融实验表明,在同质编码器上用连续头部替代交叉熵可将几何畸变降低高达8.5倍,而学习型码本表现出非单调的双重束缚——更细化的量化在改善重建效果的同时反而恶化了几何性能。在连续目标下,三种架构的差异仅为1.3倍;而在离散分词下,它们之间的差异高达3000倍。利用率失真理论和MINE评估14个生物学基础模型后,我们识别出三种失效模式:局部-全局解耦、表征压缩和几何空泛性。一项受控实验证实,Evo 2在真实DNA上的反向互补鲁棒性反映了保守的序列组成,而非学习到的对称性。没有模型能同时实现低失真、高互信息和全局一致性。