The machine learning communities, such as those around computer vision or natural language processing, have developed numerous supportive tools and benchmark datasets to accelerate the development. In contrast, the network traffic classification field lacks standard benchmark datasets for most tasks, and the available supportive software is rather limited in scope. This paper aims to address the gap and introduces DataZoo, a toolset designed to streamline dataset management in network traffic classification and to reduce the space for potential mistakes in the evaluation setup. DataZoo provides a standardized API for accessing three extensive datasets -- CESNET-QUIC22, CESNET-TLS22, and CESNET-TLS-Year22. Moreover, it includes methods for feature scaling and realistic dataset partitioning, taking into consideration temporal and service-related factors. The DataZoo toolset simplifies the creation of realistic evaluation scenarios, making it easier to cross-compare classification methods and reproduce results.
翻译:机器学习社区(如计算机视觉或自然语言处理领域)已开发出大量辅助工具和基准数据集以加速研究进展。相比之下,网络流量分类领域的大多数任务缺乏标准基准数据集,且可用辅助软件的范围相当有限。本文旨在填补这一空白,介绍DataZoo——一个旨在简化网络流量分类中数据集管理并减少评估流程中潜在错误空间的工具集。DataZoo提供了标准化API,可访问三个大型数据集——CESNET-QUIC22、CESNET-TLS22及CESNET-TLS-Year22。此外,其包含特征缩放与考虑时间及服务相关因素的现实数据集划分方法。DataZoo工具集简化了真实评估场景的构建,便于交叉比较分类方法并复现结果。