Mixture models arise in many regression problems, but most methods have seen limited adoption partly due to these algorithms' highly-tailored and model-specific nature. On the other hand, transformers are flexible, neural sequence models that present the intriguing possibility of providing general-purpose prediction methods, even in this mixture setting. In this work, we investigate the hypothesis that transformers can learn an optimal predictor for mixtures of regressions. We construct a generative process for a mixture of linear regressions for which the decision-theoretic optimal procedure is given by data-driven exponential weights on a finite set of parameters. We observe that transformers achieve low mean-squared error on data generated via this process. By probing the transformer's output at inference time, we also show that transformers typically make predictions that are close to the optimal predictor. Our experiments also demonstrate that transformers can learn mixtures of regressions in a sample-efficient fashion and are somewhat robust to distribution shifts. We complement our experimental observations by proving constructively that the decision-theoretic optimal procedure is indeed implementable by a transformer.
翻译:混合模型出现在许多回归问题中,但大多数方法因算法高度定制且依赖特定模型而未被广泛采用。另一方面,Transformer作为灵活的神经序列模型,展现了提供通用预测方法的可能性,即使在混合背景下也是如此。本研究探讨了Transformer能否为回归混合模型学习最优预测器的假设。我们构建了一个线性回归混合的生成过程,其决策理论最优过程由数据驱动的指数权重在有限参数集上给出。我们观察到,Transformer在该过程生成的数据上实现了低均方误差。通过探测推理时Transformer的输出,我们还表明Transformer通常能做出接近最优预测器的预测。实验表明,Transformer能以样本高效的方式学习回归混合模型,并对分布偏移具有一定的鲁棒性。我们通过建设性证明补充了实验观察,即决策理论最优过程确实可由Transformer实现。