A central goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best: a principle called Occam's razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam's razor and in-context learning: an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved. We make our code available at https://github.com/3rdCore/PrequentialCode.
翻译:机器学习的核心目标是泛化。尽管无免费午餐定理指出,在没有额外假设的情况下我们无法获得泛化的理论保证,但在实践中我们观察到,能够解释训练数据的最简单模型往往具有最佳泛化能力:这一原则被称为奥卡姆剃刀原理。尽管需要简单模型,当前大多数机器学习方法仅最小化训练误差,最多只能通过正则化或架构设计间接促进简洁性。本文建立了奥卡姆剃刀原理与上下文学习之间的关联:上下文学习是Transformer等序列模型在推理时根据序列中的历史观测进行学习的一种涌现能力。具体而言,我们证明用于训练上下文学习器的下一词元预测损失直接等价于一种称为序贯预测编码的数据压缩技术,并且最小化该损失相当于同时最小化训练误差与从上下文中隐式学习到的模型复杂度。我们提出的理论及支持该理论的实证实验,不仅为上下文学习提供了规范性解释,还阐明了当前上下文学习方法的局限性,并指出了改进方向。代码已发布于https://github.com/3rdCore/PrequentialCode。