Bidirectional language models have better context understanding and perform better than unidirectional models on natural language understanding tasks, yet the theoretical reasons behind this advantage remain unclear. In this work, we investigate this disparity through the lens of the Information Bottleneck (IB) principle, which formalizes a trade-off between compressing input information and preserving task-relevant content. We propose FlowNIB, a dynamic and scalable method for estimating mutual information during training that addresses key limitations of classical IB approaches, including computational intractability and fixed trade-off schedules. Theoretically, we show that bidirectional models retain more mutual information and exhibit higher effective dimensionality than unidirectional models. To support this, we present a generalized framework for measuring representational complexity and prove that bidirectional representations are strictly more informative under mild conditions. We further validate our findings through extensive experiments across multiple models and tasks using FlowNIB, revealing how information is encoded and compressed throughout training. Together, our work provides a principled explanation for the effectiveness of bidirectional architectures and introduces a practical tool for analyzing information flow in deep language models.
翻译:双向语言模型在自然语言理解任务中表现出比单向模型更优的上下文理解能力和性能,但其优势背后的理论原因尚不明确。本研究通过信息瓶颈(IB)原理的视角探究这一差异,该原理形式化了压缩输入信息与保留任务相关内容之间的权衡关系。我们提出FlowNIB——一种动态且可扩展的训练过程中互信息估计方法,该方法解决了经典IB方法的关键局限,包括计算不可行性和固定的权衡调度机制。理论上,我们证明双向模型比单向模型保留了更多互信息,并展现出更高的有效维度。为此,我们提出了一个衡量表示复杂度的广义框架,并证明在温和条件下双向表示严格包含更多信息。我们进一步通过跨多个模型和任务的FlowNIB实验验证了这些发现,揭示了训练过程中信息如何被编码与压缩。综合而言,本研究为双向架构的有效性提供了理论解释,并为分析深度语言模型中的信息流动提供了实用工具。