Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that naturally handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained solely on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art performance across DIS and HR-SOD benchmarks.
翻译:显著目标检测是典型的数据受限任务,昂贵的像素级标注迫使相关子任务(如DIS和HR-SOD)需分别训练模型。本文提出一种通过大规模合成数据生成与模糊感知架构显著提升泛化能力的方法。我们推出S3OD数据集,包含超过13.9万张通过多模态扩散管道生成的高分辨率图像,该管道利用扩散特征与DINO-v3特征自动提取标注。迭代生成框架根据模型表现优先生成困难类别样本。我们设计了一种简化的多掩码解码器,通过预测多个有效解释自然处理显著目标检测中固有的模糊性问题。仅使用合成数据训练的模型在跨数据集泛化中实现20-50%的错误率降低,而经过微调的版本在DIS和HR-SOD基准测试中均达到最先进性能。