Distribution shift is a common situation in machine learning tasks, where the data used for training a model is different from the data the model is applied to in the real world. This issue arises across multiple technical settings: from standard prediction tasks, to time-series forecasting, and to more recent applications of large language models (LLMs). This mismatch can lead to performance reductions, and can be related to a multiplicity of factors: sampling issues and non-representative data, changes in the environment or policies, or the emergence of previously unseen scenarios. This brief focuses on the definition and detection of distribution shifts in educational settings. We focus on standard prediction problems, where the task is to learn a model that takes in a series of input (predictors) $X=(x_1,x_2,...,x_m)$ and produces an output $Y=f(X)$.
翻译:分布偏移是机器学习任务中的常见情形,即用于训练模型的数据与实际应用中的数据存在差异。该问题出现在多种技术场景中:从标准预测任务到时间序列预测,再到近期大型语言模型(LLMs)的应用。这种不匹配可能导致性能下降,并可能涉及多种因素:抽样问题与非代表性数据、环境或政策变化,或先前未见过场景的出现。本简报聚焦于教育场景中分布偏移的定义与检测。我们主要关注标准预测问题,其任务是学习一个接收系列输入(预测变量)$X=(x_1,x_2,...,x_m)$ 并生成输出 $Y=f(X)$ 的模型。