We propose masked particle modeling (MPM) as a self-supervised method for learning generic, transferable, and reusable representations on unordered sets of inputs for use in high energy physics (HEP) scientific data. This work provides a novel scheme to perform masked modeling based pre-training to learn permutation invariant functions on sets. More generally, this work provides a step towards building large foundation models for HEP that can be generically pre-trained with self-supervised learning and later fine-tuned for a variety of down-stream tasks. In MPM, particles in a set are masked and the training objective is to recover their identity, as defined by a discretized token representation of a pre-trained vector quantized variational autoencoder. We study the efficacy of the method in samples of high energy jets at collider physics experiments, including studies on the impact of discretization, permutation invariance, and ordering. We also study the fine-tuning capability of the model, showing that it can be adapted to tasks such as supervised and weakly supervised jet classification, and that the model can transfer efficiently with small fine-tuning data sets to new classes and new data domains.
翻译:我们提出掩码粒子建模(MPM)作为一种自监督方法,用于学习高能物理(HEP)科学数据中无序输入集合的通用、可迁移和可复用的表示。这项工作提出了一种新颖的基于掩码建模的预训练方案,用于学习集合上的置换不变函数。更广泛地说,这项工作为构建HEP大型基础模型迈出了一步,这些模型可以通过自监督学习进行通用预训练,随后针对各种下游任务进行微调。在MPM中,集合中的粒子被掩码,训练目标是通过预训练的矢量量化变分自编码器的离散化标记表示来恢复其身份。我们在对撞机物理实验的高能喷注样本中研究了该方法的有效性,包括对离散化、置换不变性和排序影响的探讨。我们还研究了模型的微调能力,表明其可适应监督和弱监督喷注分类等任务,并且能够通过少量微调数据集高效迁移到新类别和新数据域。