Foundation models are deep learning models pre-trained on large amounts of data which are capable of generalizing to multiple datasets and/or downstream tasks. This work demonstrates how data collected by the CMS experiment at the Large Hadron Collider can be useful in pre-training foundation models for HEP. Specifically, we introduce the AspenOpenJets dataset, consisting of approximately 180M high $p_T$ jets derived from CMS 2016 Open Data. We show how pre-training the OmniJet-$\alpha$ foundation model on AspenOpenJets improves performance on generative tasks with significant domain shift: generating boosted top and QCD jets from the simulated JetClass dataset. In addition to demonstrating the power of pre-training of a jet-based foundation model on actual proton-proton collision data, we provide the ML-ready derived AspenOpenJets dataset for further public use.
翻译:基础模型是在海量数据上预训练的深度学习模型,能够泛化至多个数据集和/或下游任务。本研究论证了大型强子对撞机CMS实验所采集数据如何可用于高能物理基础模型的预训练。具体而言,我们推出AspenOpenJets数据集,该数据集包含约1.8亿个源自CMS 2016开放数据的高$p_T$喷注。我们展示了在AspenOpenJets上预训练OmniJet-$\alpha$基础模型如何提升具有显著域偏移的生成任务性能:生成来自模拟JetClass数据集的boosted顶夸克喷注与QCD喷注。除了展示基于真实质子-质子对撞数据预训练喷注基础模型的有效性外,我们还提供了可直接用于机器学习的AspenOpenJets衍生数据集以供进一步公开使用。