Prior-data fitted networks (PFNs) were recently proposed as a new paradigm for machine learning. Instead of training the network to an observed training set, a fixed model is pre-trained offline on small, simulated training sets from a variety of tasks. The pre-trained model is then used to infer class probabilities in-context on fresh training sets with arbitrary size and distribution. Empirically, PFNs achieve state-of-the-art performance on tasks with similar size to the ones used in pre-training. Surprisingly, their accuracy further improves when passed larger data sets during inference. This article establishes a theoretical foundation for PFNs and illuminates the statistical mechanisms governing their behavior. While PFNs are motivated by Bayesian ideas, a purely frequentistic interpretation of PFNs as pre-tuned, but untrained predictors explains their behavior. A predictor's variance vanishes if its sensitivity to individual training samples does and the bias vanishes only if it is appropriately localized around the test feature. The transformer architecture used in current PFN implementations ensures only the former. These findings shall prove useful for designing architectures with favorable empirical behavior.
翻译:先验数据拟合网络(PFNs)近期被提出作为机器学习的一种新范式。与将网络训练至观测到的训练集不同,固定模型会利用来自多种任务的小型模拟训练集进行离线预训练。随后,该预训练模型被用于在情境中推断任意规模和分布的新训练集上的类别概率。实验表明,PFNs在规模与预训练任务相近的任务上取得了最先进的性能。令人惊讶的是,当推理过程中传入更大数据集时,其准确率还会进一步提升。本文为PFNs建立了理论基础,并阐明了调控其行为的统计机制。尽管PFNs受贝叶斯思想启发,但对PFNs的纯频率主义解释——即将其视为预调优但未训练的预测器——能够解释其行为。若预测器对个体训练样本的敏感性消失,其方差随之消失;只有当预测器围绕测试特征适当局部化时,偏差才会消失。当前PFN实现中使用的Transformer架构仅能确保前者。这些发现将有助于设计具有良好实验行为的新架构。