The Information-Theoretic Benefit of Shared Representations under Orthogonality Constraints

Modern deep learning architectures are increasingly multi-task and multi-modal, using a pretrained foundation model combined with task-specific, fine-tuned models. Empirically, exploiting similarity across different problems, instead of solving them individually, can significantly improve overall performance. While the generalization and sample complexity properties of multitask learning have been widely studied, the parametric complexity of joint approximation in comparison to separate approximation remains less well understood. The question is particularly relevant in modern deep learning, where models are increasingly required to satisfy structural constraints such as equivariance, conservation laws, or orthogonality. We prove lower and upper bounds on the description-length for separate and joint approximation classes, respectively, in uniform norm. We build a class of orthogonal functions by composing a shared hard feature, realized by a Rademacher-Haar wavelet series, with Sawtooth-Walsh readouts to enforce orthogonality of output coordinates. The dyadic tree structure of the Rademacher-Haar wavelet concentrates the approximation hardness in the common feature component, while the readouts act as task-specific heads. Using an information-theoretic framework, we obtain a sharp gap between the optimal approximation rates achievable by joint and separate coding. Finally, we realize this separation in a neural network model using Heaviside activations via reduction to triangle-wave approximation. Our results show that even under an orthogonality constraint joint approximation requires strictly fewer bits in compositional architectures, provided the tasks share a latent hard feature. This provides theoretical insight into the description-length-efficiency of compositional multi-output architectures and clarifies how neural networks can retain expressivity under geometric constraints.

翻译：现代深度学习架构日益呈现出多任务与多模态特性，通常采用预训练基础模型与任务特定微调模型相结合的方式。实验表明，利用不同问题的相似性（而非单独求解）能够显著提升整体性能。尽管多任务学习的泛化性与样本复杂度已得到广泛研究，但联合逼近与独立逼近的参数量复杂度差异仍待深入理解。该问题在现代深度学习中尤为关键——模型需满足等变性、守恒律或正交性等结构约束。我们分别证明了在一致范数下，独立逼近类与联合逼近类的描述长度的下界与上界。通过将Rademacher-Haar小波级数实现的共享硬特征与Sawtooth-Walsh读出函数相结合，构建了一类正交函数，以强制输出坐标的正交性。Rademacher-Haar小波的二叉树结构将逼近困难集中于公共特征组件，而读出函数则充当任务特定头部。基于信息论框架，我们揭示了联合编码与独立编码在最优逼近速率间的显著差距。最后，通过将三角形波逼近约简至Heaviside激活的神经网络模型，实现了该分离现象。研究结果表明：在组合式架构中，只要任务共享潜在硬特征，即使存在正交约束，联合逼近所需的比特数也严格更少。这为组合式多输出架构的描述长度效率提供了理论洞见，并阐明了神经网络如何在几何约束下保持表达能力。