DExTeR: Weakly Semi-Supervised Object Detection with Class and Instance Experts for Medical Imaging

Detecting anatomical landmarks in medical imaging is essential for diagnosis and intervention guidance. However, object detection models rely on costly bounding box annotations, limiting scalability. Weakly Semi-Supervised Object Detection (WSSOD) with point annotations proposes annotating each instance with a single point, minimizing annotation time while preserving localization signals. A Point-to-Box teacher model, trained on a small box-labeled subset, converts these point annotations into pseudo-box labels to train a student detector. Yet, medical imagery presents unique challenges, including overlapping anatomy, variable object sizes, and elusive structures, which hinder accurate bounding box inference. To overcome these challenges, we introduce DExTeR (DETR with Experts), a transformer-based Point-to-Box regressor tailored for medical imaging. Built upon Point-DETR, DExTeR encodes single-point annotations as object queries, refining feature extraction with the proposed class-guided deformable attention, which guides attention sampling using point coordinates and class labels to capture class-specific characteristics. To improve discrimination in complex structures, it introduces CLICK-MoE (CLass, Instance, and Common Knowledge Mixture of Experts), decoupling class and instance representations to reduce confusion among adjacent or overlapping instances. Finally, we implement a multi-point training strategy which promotes prediction consistency across different point placements, improving robustness to annotation variability. DExTeR achieves state-of-the-art performance across three datasets spanning different medical domains (endoscopy, chest X-rays, and endoscopic ultrasound) highlighting its potential to reduce annotation costs while maintaining high detection accuracy.

翻译：在医学影像中检测解剖标志物对于诊断和干预指导至关重要。然而，目标检测模型依赖于成本高昂的边界框标注，限制了其可扩展性。采用点标注的弱半监督目标检测（WSSOD）提出用单个点标注每个实例，在保留定位信号的同时最小化标注时间。一个在少量带边界框标注子集上训练的 Point-to-Box 教师模型，将这些点标注转换为伪边界框标签，用于训练学生检测器。然而，医学影像存在独特的挑战，包括解剖结构重叠、目标尺寸多变以及结构难以捉摸，这些都阻碍了准确的边界框推断。为克服这些挑战，我们提出了 DExTeR（DETR with Experts），一种专为医学影像设计的、基于 Transformer 的 Point-to-Box 回归器。DExTeR 基于 Point-DETR 构建，将单点标注编码为对象查询，并通过提出的类引导可变形注意力来细化特征提取。该注意力机制利用点坐标和类别标签引导注意力采样，以捕捉类别特定的特征。为了提升在复杂结构中的判别能力，DExTeR 引入了 CLICK-MoE（类别、实例与常识混合专家），解耦类别和实例表示以减少相邻或重叠实例间的混淆。最后，我们实施了一种多点训练策略，该策略促进了不同点放置位置下预测的一致性，从而提升了对标注变异的鲁棒性。DExTeR 在涵盖不同医学领域（内窥镜、胸部 X 射线和内镜超声）的三个数据集上均取得了最先进的性能，突显了其在保持高检测精度的同时降低标注成本的潜力。