The success of deep learning is largely due to the availability of large amounts of training data that cover a wide range of examples of a particular concept or meaning. In the field of medicine, having a diverse set of training data on a particular disease can lead to the development of a model that is able to accurately predict the disease. However, despite the potential benefits, there have not been significant advances in image-based diagnosis due to a lack of high-quality annotated data. This article highlights the importance of using a data-centric approach to improve the quality of data representations, particularly in cases where the available data is limited. To address this "small-data" issue, we discuss four methods for generating and aggregating training data: data augmentation, transfer learning, federated learning, and GANs (generative adversarial networks). We also propose the use of knowledge-guided GANs to incorporate domain knowledge in the training data generation process. With the recent progress in large pre-trained language models, we believe it is possible to acquire high-quality knowledge that can be used to improve the effectiveness of knowledge-guided generative methods.
翻译:深度学习的成功在很大程度上归因于大量训练数据的可用性,这些数据涵盖了特定概念或意义的广泛实例。在医学领域,拥有关于特定疾病的多样化训练数据集,有助于开发出能够准确预测该疾病的模型。然而,尽管有潜在的好处,由于缺乏高质量标注数据,基于图像的诊断并未取得显著进展。本文强调了采用数据中心化方法提升数据表示质量的重要性,尤其是在可用数据有限的情况下。为解决这一“小数据”问题,我们讨论了四种生成和聚合训练数据的方法:数据增强、迁移学习、联邦学习和生成对抗网络(GANs)。我们还提出利用知识引导的GANs,在训练数据生成过程中融入领域知识。随着大规模预训练语言模型的最新进展,我们相信能够获取高质量知识,从而提升知识引导生成方法的有效性。