DBAT: Dynamic Backward Attention Transformer for Material Segmentation with Cross-Resolution Patches

The objective of dense material segmentation is to identify the material categories for every image pixel. Recent studies adopt image patches to extract material features. Although the trained networks can improve the segmentation performance, their methods choose a fixed patch resolution which fails to take into account the variation in pixel area covered by each material. In this paper, we propose the Dynamic Backward Attention Transformer (DBAT) to aggregate cross-resolution features. The DBAT takes cropped image patches as input and gradually increases the patch resolution by merging adjacent patches at each transformer stage, instead of fixing the patch resolution during training. We explicitly gather the intermediate features extracted from cross-resolution patches and merge them dynamically with predicted attention masks. Experiments show that our DBAT achieves an accuracy of 86.85%, which is the best performance among state-of-the-art real-time models. Like other successful deep learning solutions with complex architectures, the DBAT also suffers from lack of interpretability. To address this problem, this paper examines the properties that the DBAT makes use of. By analysing the cross-resolution features and the attention weights, this paper interprets how the DBAT learns from image patches. We further align features to semantic labels, performing network dissection, to infer that the proposed model can extract material-related features better than other methods. We show that the DBAT model is more robust to network initialisation, and yields fewer variable predictions compared to other models. The project code is available at https://github.com/heng-yuwen/Dynamic-Backward-Attention-Transformer.

翻译：密集材料分割的目标是识别每个图像像素的材料类别。近期研究采用图像斑块提取材料特征，尽管训练后的网络可提升分割性能，但其方法通常选取固定的斑块分辨率，未能考虑不同材料覆盖像素区域的变化性。本文提出动态反向注意力Transformer（DBAT）以聚合跨分辨率特征。DBAT以裁剪图像斑块为输入，通过在每个Transformer阶段合并相邻斑块逐步提升分辨率，而非在训练中固定斑块分辨率。我们显式收集跨分辨率斑块提取的中间特征，并通过预测注意力掩码动态融合这些特征。实验表明，DBAT达到86.85%的准确率，在现有最优实时模型中实现最佳性能。与其他具有复杂架构的成功深度学习方案相同，DBAT亦面临可解释性不足的问题。为此，本文探究DBAT所利用的模型特性：通过分析跨分辨率特征与注意力权重，阐释DBAT如何从图像斑块中学习；进一步将特征与语义标签对齐，进行网络剖析，推断所提模型相比其他方法能更有效地提取材料相关特征。研究表明，DBAT模型对网络初始化具有更强鲁棒性，且相比其他模型产生更少可变预测。项目代码见https://github.com/heng-yuwen/Dynamic-Backward-Attention-Transformer。