In certain emerging applications such as health monitoring wearable and traffic monitoring systems, Internet-of-Things (IoT) devices generate or collect a huge amount of multi-label datasets. Within these datasets, each instance is linked to a set of labels. The presence of noisy, redundant, or irrelevant features in these datasets, along with the curse of dimensionality, poses challenges for multi-label classifiers. Feature selection (FS) proves to be an effective strategy in enhancing classifier performance and addressing these challenges. Yet, there is currently no existing distributed multi-label FS method documented in the literature that is suitable for distributed multi-label datasets within IoT environments. This paper introduces FMLFS, the first federated multi-label feature selection method. Here, mutual information between features and labels serves as the relevancy metric, while the correlation distance between features, derived from mutual information and joint entropy, is utilized as the redundancy measure. Following aggregation of these metrics on the edge server and employing Pareto-based bi-objective and crowding distance strategies, the sorted features are subsequently sent back to the IoT devices. The proposed method is evaluated through two scenarios: 1) transmitting reduced-size datasets to the edge server for centralized classifier usage, and 2) employing federated learning with reduced-size datasets. Evaluation across three metrics - performance, time complexity, and communication cost - demonstrates that FMLFS outperforms five other comparable methods in the literature and provides a good trade-off on three real-world datasets.
翻译:在某些新兴应用(如健康监测可穿戴设备和交通监控系统)中,物联网设备生成或收集海量多标签数据集。在这些数据集中,每个实例都与一组标签相关联。数据集中存在的噪声、冗余或不相关特征,以及维度灾难问题,给多标签分类器带来了挑战。特征选择被证明是提升分类器性能并应对这些挑战的有效策略。然而,目前文献中尚无适用于物联网环境下分布式多标签数据集的分布式多标签特征选择方法。本文提出了FMLFS,即首个联邦多标签特征选择方法。该方法以特征与标签间的互信息作为相关性度量,并利用基于互信息与联合熵推导出的特征间相关距离作为冗余性度量。在边缘服务器上聚合这些度量指标后,采用基于帕累托的双目标与拥挤距离策略对特征排序,随后将排序后的特征发回物联网设备。通过两种场景对所提方法进行评估:1) 将降维后的数据集传输至边缘服务器供集中式分类器使用;2) 在降维后的数据集上采用联邦学习。通过在性能、时间复杂度和通信成本三个指标上的评估表明,FMLFS在三个真实数据集上优于文献中其他五种可比方法,并实现了良好的权衡。