In certain emerging applications such as health monitoring wearable and traffic monitoring systems, Internet-of-Things (IoT) devices generate or collect a huge amount of multi-label datasets. Within these datasets, each instance is linked to a set of labels. The presence of noisy, redundant, or irrelevant features in these datasets, along with the curse of dimensionality, poses challenges for multi-label classifiers. Feature selection (FS) proves to be an effective strategy in enhancing classifier performance and addressing these challenges. Yet, there is currently no existing distributed multi-label FS method documented in the literature that is suitable for distributed multi-label datasets within IoT environments. This paper introduces FMLFS, the first federated multi-label feature selection method. Here, mutual information between features and labels serves as the relevancy metric, while the correlation distance between features, derived from mutual information and joint entropy, is utilized as the redundancy measure. Following aggregation of these metrics on the edge server and employing Pareto-based bi-objective and crowding distance strategies, the sorted features are subsequently sent back to the IoT devices. The proposed method is evaluated through two scenarios: 1) transmitting reduced-size datasets to the edge server for centralized classifier usage, and 2) employing federated learning with reduced-size datasets. Evaluation across three metrics - performance, time complexity, and communication cost - demonstrates that FMLFS outperforms five other comparable methods in the literature and provides a good trade-off on three real-world datasets.
翻译:摘要:在诸如健康监测可穿戴设备和交通监测系统等新兴应用中,物联网设备生成或收集海量多标签数据集。在这些数据集中,每个实例关联一组标签。数据集中存在的噪声、冗余或不相关特征以及维数灾难,为多标签分类器带来了挑战。特征选择作为提升分类器性能并应对这些挑战的有效策略,然而文献中尚无适用于物联网环境下分布式多标签数据集的分布式多标签特征选择方法。本文提出FMLFS,这是首个联邦式多标签特征选择方法。该方法以特征与标签之间的互信息作为相关性度量,以基于互信息和联合熵导出的特征间相关距离作为冗余度度量。在边缘服务器上聚合这些度量后,采用基于帕累托的双目标与拥挤距离策略,将排序后的特征返回至物联网设备。通过两种场景对所提方法进行评估:1)向边缘服务器传输缩减后的数据集以供集中式分类器使用;2)对缩减后的数据集采用联邦学习。在性能、时间复杂度和通信成本三个维度的评估表明,FMLFS优于文献中五种可比方法,并在三个真实数据集上实现了良好的权衡。