Wrist pathology recognition from radiographs is challenging because normal appearance varies markedly with age and sex. In pediatric imaging, evolving carpal ossification and open growth plates can resemble fractures, while sex-dependent timing of physeal closure changes key visual cues. Image-only models therefore risk confusing developmental anatomy for pathology, especially in small medical datasets. We address this by framing wrist diagnosis as a fine-grained visual recognition (FGVR) task and introducing a multimodal transformer that fuses X-rays with demographic metadata (age and sex). Unlike prior work, this is the first study to integrate metadata for wrist pathology recognition, and we further show that fine-grained pretraining transfers better than coarse ImageNet initialization. Our approach improves accuracy by 2% on a small curated dataset and by over 10% on a larger fracture dataset.
翻译:从X光片中识别腕部病理具有挑战性,因为正常外观随年龄和性别差异显著。在儿科影像中,不断演变的腕骨骨化和开放的生长板可能类似骨折,而依赖于性别的骨骺闭合时间改变了关键的视觉线索。因此,仅依赖图像的模型可能将发育解剖结构误认为病理,尤其是在小型医疗数据集中。我们通过将腕部诊断构建为一个细粒度视觉识别任务,并引入一种融合X光片与人口统计元数据(年龄和性别)的多模态Transformer来解决这一问题。与先前工作不同,这是首个整合元数据用于腕部病理识别的研究,并且我们进一步证明,细粒度预训练比粗粒度的ImageNet初始化具有更好的迁移效果。我们的方法在一个小型精选数据集上准确率提高了2%,在一个更大的骨折数据集上准确率提高了超过10%。