Chest X-rays (CXRs) are a medical imaging modality that is used to infer a large number of abnormalities. While it is hard to define an exhaustive list of these abnormalities, which may co-occur on a chest X-ray, few of them are quite commonly observed and are abundantly represented in CXR datasets used to train deep learning models for automated inference. However, it is challenging for current models to learn independent discriminatory features for labels that are rare but may be of high significance. Prior works focus on the combination of multi-label and long tail problems by introducing novel loss functions or some mechanism of re-sampling or re-weighting the data. Instead, we propose that it is possible to achieve significant performance gains merely by choosing an initialization for a model that is closer to the domain of the target dataset. This method can complement the techniques proposed in existing literature, and can easily be scaled to new labels. Finally, we also examine the veracity of synthetically generated data to augment the tail labels and analyse its contribution to improving model performance.
翻译:胸部X光片(CXR)是一种医学成像模态,用于推断大量异常病变。虽然很难定义这些可能在胸片上共存的异常病变的详尽列表,但其中少数异常较为常见,并在用于训练深度学习模型以进行自动推断的CXR数据集中大量存在。然而,当前模型难以学习那些罕见但可能具有重要临床意义的标签的独立判别特征。现有研究通过引入新型损失函数或采用重采样/重加权机制来解决多标签与长尾问题的组合。与之不同的是,我们提出仅通过选择更接近目标数据集领域的模型初始化,就能实现显著的性能提升。该方法可与现有文献中的技术互补,并易于扩展到新标签。最后,我们还评估了合成数据增强尾部标签的真实性,并分析了其对提升模型性能的贡献。