The data needed for machine learning (ML) model training, can reside in different separate sites often termed data silos. For data-intensive ML applications, data silos pose a major challenge: the integration and transformation of data demand a lot of manual work and computational resources. With data privacy and security constraints, data often cannot leave the local sites, and a model has to be trained in a decentralized manner. In this work, we present a vision on how to bridge the traditional data integration (DI) techniques with the requirements of modern machine learning. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness and efficiency of ML models. We analyze two common use cases over data silos, feature augmentation and federated learning. Bringing data integration and machine learning together, we highlight the new research opportunities from the aspects of systems, intermediate representations, factorized learning and federated learning.
翻译:机器学习模型训练所需的数据可能分布在不同的独立站点中,通常被称为数据孤岛。对于数据密集型机器学习应用而言,数据孤岛构成了重大挑战:数据的集成与转换需要大量的人工操作和计算资源。受数据隐私和安全约束的限制,数据通常不能离开本地站点,因此模型必须以去中心化的方式进行训练。本文提出了一种愿景,旨在将传统数据集成技术(DI)与现代机器学习的需求相融合。我们探索了利用从数据集成过程中获取的元数据来提升机器学习模型有效性与效率的可能性。我们分析了数据孤岛下的两个常见用例:特征增强和联邦学习。通过将数据集成与机器学习相结合,我们从系统、中间表示、因子化学习和联邦学习等方面强调了新的研究机遇。