鲁棒性至关重要：大语言模型在数据拟合中的局限性 (Robustness is Important: Limitations of LLMs for Data Fitting)

Large Language Models (LLMs) are being applied in a wide array of settings, well beyond the typical language-oriented use cases. In particular, LLMs are increasingly used as a plug-and-play method for fitting data and generating predictions. Prior work has shown that LLMs, via in-context learning or supervised fine-tuning, can perform competitively with many tabular supervised learning techniques in terms of predictive performance. However, we identify a critical vulnerability of using LLMs for data fitting -- making changes to data representation that are completely irrelevant to the underlying learning task can drastically alter LLMs' predictions on the same data. For example, simply changing variable names can sway the size of prediction error by as much as 82% in certain settings. Such prediction sensitivity with respect to task-irrelevant variations manifests under both in-context learning and supervised fine-tuning, for both close-weight and open-weight general-purpose LLMs. Moreover, by examining the attention scores of an open-weight LLM, we discover a non-uniform attention pattern: training examples and variable names/values which happen to occupy certain positions in the prompt receive more attention when output tokens are generated, even though different positions are expected to receive roughly the same attention. This partially explains the sensitivity in the presence of task-irrelevant variations. We also consider a state-of-the-art tabular foundation model (TabPFN) trained specifically for data fitting. Despite being explicitly designed to achieve prediction robustness, TabPFN is still not immune to task-irrelevant variations. Overall, despite LLMs' impressive predictive capabilities, currently they lack even the basic level of robustness to be used as a principled data-fitting tool.

翻译：大语言模型（LLMs）正被广泛应用于各类场景，远超典型的语言导向用例。特别是，LLMs越来越多地被用作即插即用的数据拟合与预测生成方法。先前研究表明，通过上下文学习或监督微调，LLMs在预测性能方面可与许多表格监督学习技术相竞争。然而，我们发现了使用LLMs进行数据拟合的一个关键脆弱性——对数据表示进行与底层学习任务完全无关的改动，可能大幅改变LLMs对相同数据的预测结果。例如，在某些设定中仅更改变量名称就可使预测误差波动高达82%。这种针对任务无关变异的预测敏感性同时存在于上下文学习和监督微调中，且对闭源和开源通用大语言模型均有影响。此外，通过分析开源LLM的注意力分数，我们发现了一种非均匀注意力模式：训练样本及变量名称/数值若恰好占据提示文本中的特定位置，在生成输出词元时会获得更多关注，尽管不同位置预期应获得大致均等的注意力。这在一定程度上解释了存在任务无关变异时的敏感性现象。我们还考察了专为数据拟合训练的最先进表格基础模型（TabPFN）。尽管该模型被明确设计以实现预测鲁棒性，TabPFN仍无法完全免疫于任务无关变异的影响。总体而言，尽管LLMs展现出卓越的预测能力，当前它们仍缺乏作为规范化数据拟合工具所需的基本鲁棒性。