In this paper, we present a transformer-based method to spatio-temporally associate apple fruitlets in stereo-images collected on different days and from different camera poses. State-of-the-art association methods in agriculture are dedicated towards matching larger crops using either high-resolution point clouds or temporally stable features, which are both difficult to obtain for smaller fruit in the field. To address these challenges, we propose a transformer-based architecture that encodes the shape and position of each fruitlet, and propagates and refines these features through a series of transformer encoder layers with alternating self and cross-attention. We demonstrate that our method is able to achieve an F1-score of 92.4% on data collected in a commercial apple orchard and outperforms all baselines and ablations.
翻译:本文提出了一种基于Transformer的方法,用于在不同日期、不同相机位姿采集的立体图像中对苹果幼果进行时空关联。当前农业领域最先进的关联方法主要致力于利用高分辨率点云或时间稳定特征匹配较大型作物,这两种数据对于田间小型果实均难以获取。为解决这些挑战,我们提出了一种基于Transformer的架构,该架构编码每个幼果的形状和位置特征,并通过一系列交替使用自注意力与交叉注意力的Transformer编码器层对这些特征进行传播和优化。实验表明,我们的方法在商业苹果园采集数据上实现了92.4%的F1分数,性能优于所有基线方法和消融实验模型。