Large language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model's pretraining data is assumed to be easier for that model. We test whether distributional and example-specific similarity measures (embedding-, token- and model-based) correlate with language model performance through a large-scale comparison of the Pile and C4 pretraining datasets with downstream benchmarks. Similarity correlates with performance for multilingual datasets, but in other benchmarks, we surprisingly find that similarity metrics are not correlated with accuracy or even each other. This suggests that the relationship between pretraining data and downstream tasks is more complex than often assumed.
翻译:大型语言模型在许多下游任务上表现优异,但并非所有任务皆然。通常认为,预训练数据与任务数据之间的交互作用是决定这种差异的关键:若某任务的数据与模型的预训练数据更为相似,则该任务对该模型而言可能更容易。我们通过大规模比较Pile和C4预训练数据集与下游基准测试,检验了分布级和示例级相似性度量(基于嵌入、词元和模型)是否与语言模型表现相关。研究发现,相似性度量与多语言数据集的表现存在相关性,但在其他基准测试中,令人惊讶的是,相似性指标与准确性甚至彼此之间均不相关。这表明预训练数据与下游任务之间的关系比通常假设的更为复杂。