SLUM-i: Semi-supervised Learning for Urban Mapping of Informal Settlements and Data Quality Benchmarking

Rapid urban expansion has fueled the growth of informal settlements in major cities of low- and middle-income countries, with Lahore and Karachi in Pakistan and Mumbai in India serving as prominent examples. However, large-scale mapping of these settlements is severely constrained not only by the scarcity of annotations but by inherent data quality challenges, specifically high spectral ambiguity between formal and informal structures and significant annotation noise. We address this by introducing a benchmark dataset for Lahore, constructed from scratch, along with companion datasets for Karachi and Mumbai, which were derived from verified administrative boundaries, totaling 1,869 $\text{km}^2$ of area. To evaluate the global robustness of our framework, we extend our experiments to five additional established benchmarks, encompassing eight cities across three continents, and provide comprehensive data quality assessments of all datasets. We also propose a new semi-supervised segmentation framework designed to mitigate the class imbalance and feature degradation inherent in standard semi-supervised learning pipelines. Our method integrates a Class-Aware Adaptive Thresholding mechanism that dynamically adjusts confidence thresholds to prevent minority class suppression and a Prototype Bank System that enforces semantic consistency by anchoring predictions to historically learned high-fidelity feature representations. Extensive experiments across a total of eight cities spanning three continents demonstrate that our approach outperforms state-of-the-art semi-supervised baselines. Most notably, our method demonstrates superior domain transfer capability whereby a model trained on only 10% of source labels reaches a 0.461 mIoU on unseen geographies and outperforms the zero-shot generalization of fully supervised models.

翻译：快速的城市扩张推动了中低收入国家主要城市中非正规住区的增长，巴基斯坦的拉合尔和卡拉奇以及印度的孟买是其中的突出案例。然而，对这些住区进行大规模测绘不仅受到标注稀缺的严重制约，还面临固有的数据质量挑战，特别是正规与非正规建筑之间的高光谱模糊性以及显著的标注噪声。为此，我们引入了一个从头构建的拉合尔基准数据集，以及从经过验证的行政边界导出的卡拉奇和孟买配套数据集，总面积达1,869 $\text{km}^2$。为了评估我们框架的全局鲁棒性，我们将实验扩展到五个额外的既有基准，涵盖三大洲的八个城市，并对所有数据集进行了全面的数据质量评估。我们还提出了一种新的半监督分割框架，旨在缓解标准半监督学习流程中固有的类别不平衡和特征退化问题。我们的方法集成了一个**类别感知自适应阈值**机制，该机制动态调整置信度阈值以防止少数类被抑制；以及一个**原型库系统**，通过将预测锚定到历史学习到的高保真特征表示上来强制语义一致性。在横跨三大洲总计八个城市上进行的大量实验表明，我们的方法优于最先进的半监督基线。最值得注意的是，我们的方法展示了卓越的领域迁移能力：仅使用10%源标签训练的模型在未见过的地理区域上达到了0.461 mIoU，并且优于全监督模型的零样本泛化能力。