Self-supervision has been widely explored as a means of addressing the lack of inductive biases in vision transformer architectures, which limits generalisation when networks are trained on small datasets. This is crucial in the context of cortical imaging, where phenotypes are complex and heterogeneous, but the available datasets are limited in size. This paper builds upon recent advancements in translating vision transformers to surface meshes and investigates the potential of Masked AutoEncoder (MAE) self-supervision for cortical surface learning. By reconstructing surface data from a masked version of the input, the proposed method effectively models cortical structure to learn strong representations that translate to improved performance in downstream tasks. We evaluate our approach on cortical phenotype regression using the developing Human Connectome Project (dHCP) and demonstrate that pre-training leads to a 26\% improvement in performance, with an 80\% faster convergence, compared to models trained from scratch. Furthermore, we establish that pre-training vision transformer models on large datasets, such as the UK Biobank (UKB), enables the acquisition of robust representations for finetuning in low-data scenarios. Our code and pre-trained models are publicly available at \url{https://github.com/metrics-lab/surface-vision-transformers}.
翻译:自监督学习被广泛探索用于解决视觉Transformer架构中归纳偏置缺失的问题,这种缺失限制了网络在小数据集上训练时的泛化能力。这在皮层影像领域尤为关键——表型复杂多样且数据集规模有限。本文基于将视觉Transformer迁移至表面网格的最新进展,探究了掩码自编码器(MAE)自监督方法在皮层表面学习中的潜力。通过从输入数据的掩码版本中重建表面数据,所提方法有效建模皮层结构以学习强表征,进而提升下游任务性能。我们在发展性人类连接组计划(dHCP)数据集上对皮层表型回归任务进行评估,结果表明:相较于从零训练的模型,预训练可使性能提升26%且收敛速度加快80%。此外,我们在UK Biobank(UKB)等大规模数据集上验证,视觉Transformer模型的预训练能够获取鲁棒表征,从而支持低数据场景下的微调。我们的代码与预训练模型已开源至 \url{https://github.com/metrics-lab/surface-vision-transformers}。