Large datasets in machine learning often contain missing data, which necessitates the imputation of missing data values. In this work, we are motivated by network traffic classification, where traditional data imputation methods do not perform well. We recognize that no existing method directly accounts for classification accuracy during data imputation. Therefore, we propose a joint data imputation and data classification method, termed generative adversarial classification network (GACN), whose architecture contains a generator network, a discriminator network, and a classification network, which are iteratively optimized toward the ultimate objective of classification accuracy. For the scenario where some data samples are unlabeled, we further propose an extension termed semi-supervised GACN (SSGACN), which is able to use the partially labeled data to improve classification accuracy. We conduct experiments with real-world network traffic data traces, which demonstrate that GACN and SS-GACN can more accurately impute data features that are more important for classification, and they outperform existing methods in terms of classification accuracy.
翻译:机器学习中的大规模数据集常存在缺失数据,这需要对缺失数据值进行插补。本研究受网络流量分类问题驱动,发现传统数据插补方法在该场景下表现不佳。我们认识到现有方法均未在数据插补过程中直接考虑分类精度。因此,我们提出一种联合数据插补与数据分类的方法,称为生成对抗分类网络(GACN),其架构包含生成器网络、判别器网络和分类器网络,三者通过迭代优化以最终提升分类精度为目标。针对部分数据样本无标签的情形,我们进一步提出扩展方法——半监督GACN(SS-GACN),该方法能够利用部分标注数据提升分类精度。基于真实网络流量数据轨迹的实验表明,GACN和SS-GACN能更精准地插补对分类更重要的数据特征,且在分类精度上优于现有方法。