Data Collaboration Analysis applied to Compound Datasets and the Introduction of Projection data to Non-IID settings

Given the time and expense associated with bringing a drug to market, numerous studies have been conducted to predict the properties of compounds based on their structure using machine learning. Federated learning has been applied to compound datasets to increase their prediction accuracy while safeguarding potentially proprietary information. However, federated learning is encumbered by low accuracy in not identically and independently distributed (non-IID) settings, i.e., data partitioning has a large label bias, and is considered unsuitable for compound datasets, which tend to have large label bias. To address this limitation, we utilized an alternative method of distributed machine learning to chemical compound data from open sources, called data collaboration analysis (DC). We also proposed data collaboration analysis using projection data (DCPd), which is an improved method that utilizes auxiliary PubChem data. This improves the quality of individual user-side data transformations for the projection data for the creation of intermediate representations. The classification accuracy, i.e., area under the curve in the receiver operating characteristic curve (ROC-AUC) and AUC in the precision-recall curve (PR-AUC), of federated averaging (FedAvg), DC, and DCPd was compared for five compound datasets. We determined that the machine learning performance for non-IID settings was in the order of DCPd, DC, and FedAvg, although they were almost the same in identically and independently distributed (IID) settings. Moreover, the results showed that compared to other methods, DCPd exhibited a negligible decline in classification accuracy in experiments with different degrees of label bias. Thus, DCPd can address the low performance in non-IID settings, which is one of the challenges of federated learning.

翻译：鉴于将药物推向市场所需的时间和成本，已有众多研究利用机器学习基于化合物结构预测其性质。联邦学习已被应用于化合物数据集，以提高预测精度并保护潜在知识产权。然而，联邦学习在非独立同分布（non-IID）场景中面临精度较低的困境，即数据划分存在较大标签偏倚，因此被认为不适用于通常具有较大标签偏倚的化合物数据集。为解决这一局限，我们采用了另一种名为数据协作分析（DC）的分布式机器学习方法，并将其应用于来自开源资源的化合物数据。我们还提出了基于投影数据的数据协作分析（DCPd），这是一种利用辅助PubChem数据的改进方法。该方法通过为投影数据构建中间表示，提升了各用户侧数据转换的质量。针对五个化合物数据集，我们比较了联邦平均（FedAvg）、DC和DCPd的分类精度，即受试者工作特征曲线下面积（ROC-AUC）和精确率-召回率曲线下面积（PR-AUC）。结果表明，在non-IID场景下，机器学习性能排序为DCPd、DC和FedAvg，而在独立同分布（IID）场景下三者几乎相同。此外，实验结果还显示，在不同标签偏倚程度的实验中，与其他方法相比，DCPd的分类精度下降可忽略不计。因此，DCPd能够解决联邦学习面临的挑战之一——在non-IID场景下的低性能问题。