Genomic selection (GS), as a critical crop breeding strategy, plays a key role in enhancing food production and addressing the global hunger crisis. The predominant approaches in GS currently revolve around employing statistical methods for prediction. However, statistical methods often come with two main limitations: strong statistical priors and linear assumptions. A recent trend is to capture the non-linear relationships between markers by deep learning. However, as crop datasets are commonly long sequences with limited samples, the robustness of deep learning models, especially Transformers, remains a challenge. In this work, to unleash the unexplored potential of attention mechanism for the task of interest, we propose a simple yet effective Transformer-based framework that enables end-to-end training of the whole sequence. Via experiments on rice3k and wheat3k datasets, we show that, with simple tricks such as k-mer tokenization and random masking, Transformer can achieve overall superior performance against seminal methods on GS tasks of interest.
翻译:基因组选择作为一种关键的作物育种策略,在提高粮食产量和应对全球饥饿危机方面发挥着关键作用。目前,基因组选择的主流方法主要围绕使用统计方法进行预测。然而,统计方法通常存在两个主要局限:强烈的统计先验和线性假设。近期的一个趋势是通过深度学习来捕捉标记之间的非线性关系。然而,由于作物数据集通常是样本有限的长序列,深度学习模型,尤其是Transformer的鲁棒性仍然是一个挑战。在本工作中,为了释放注意力机制在目标任务中尚未开发的潜力,我们提出了一种简单而有效的基于Transformer的框架,该框架支持对整个序列进行端到端训练。通过在rice3k和wheat3k数据集上的实验,我们表明,通过诸如k-mer标记化和随机掩码等简单技巧,Transformer可以在目标基因组选择任务上,相对于开创性方法取得整体更优的性能。