Pediatric wrist pathologies recognition from radiographs is challenging because normal anatomy changes rapidly with development: evolving carpal ossification and open physes can resemble pathology, and maturation timing differs by sex. Image-only models trained on limited medical datasets therefore risk confusing normal developmental variation with true pathologies. We address this by framing pediatric wrist diagnosis as a fine-grained visual recognition (FGVR) problem and proposing a demographic-aware hybrid convolution--transformer model that fuses X-rays with patient age and sex. To leverage demographic context while avoiding shortcut reliance, we introduce progressive metadata masking during training. We evaluate on a curated dataset that mirrors the typical constraints in real-world medical studies. The hybrid FGVR backbone outperforms traditional and modern CNNs, and demographic fusion yields additional gains. Finally, we show that initializing from a fine-grained pretraining source improves transfer relative to standard ImageNet initialization, suggesting that label granularity, even from non-medical data, can be a key driver of generalization for subtle radiographic findings.
翻译:从X光片中识别儿科腕部病理具有挑战性,因为正常解剖结构随发育快速变化:不断演变的腕骨骨化和开放的骨骺可能类似病理表现,且成熟时间因性别而异。因此,仅在有限医学数据集上训练的纯图像模型存在将正常发育变异与真实病理混淆的风险。我们通过将儿科腕部诊断构建为细粒度视觉识别问题,并提出一种融合X光片与患者年龄、性别信息的人口统计特征感知混合卷积-Transformer模型来解决此问题。为了在利用人口统计背景的同时避免对捷径的依赖,我们在训练中引入了渐进式元数据掩码。我们在一个模拟真实世界医学研究典型限制的精选数据集上进行了评估。该混合细粒度视觉识别骨干网络优于传统和现代CNN模型,而人口统计特征融合带来了额外的性能提升。最后,我们表明,相对于标准的ImageNet初始化,从细粒度预训练源进行初始化能改善迁移效果,这表明即使是非医学数据的标签粒度,也可能是提升对细微放射学发现泛化能力的关键驱动因素。