Pre-trained models operating directly on raw bytes have achieved promising performance in encrypted network traffic classification (NTC), but often suffer from shortcut learning-relying on spurious correlations that fail to generalize to real-world data. Existing solutions heavily rely on model-specific interpretation techniques, which lack adaptability and generality across different model architectures and deployment scenarios. In this paper, we propose BiasSeeker, the first semi-automated framework that is both model-agnostic and data-driven for detecting dataset-specific shortcut features in encrypted traffic. By performing statistical correlation analysis directly on raw binary traffic, BiasSeeker identifies spurious or environment-entangled features that may compromise generalization, independent of any classifier. To address the diverse nature of shortcut features, we introduce a systematic categorization and apply category-specific validation strategies that reduce bias while preserving meaningful information. We evaluate BiasSeeker on 19 public datasets across three NTC tasks. By emphasizing context-aware feature selection and dataset-specific diagnosis, BiasSeeker offers a novel perspective for understanding and addressing shortcut learning in encrypted network traffic classification, raising awareness that feature selection should be an intentional and scenario-sensitive step prior to model training.
翻译:直接在原始字节上运行的预训练模型在加密网络流量分类中已展现出有前景的性能,但常常遭受捷径学习的影响——即依赖于虚假相关性,这些相关性无法泛化到真实世界数据。现有解决方案严重依赖于模型特定的解释技术,这些技术缺乏跨不同模型架构和部署场景的适应性与通用性。本文提出BiasSeeker,这是首个半自动化框架,它既是模型无关的,又是数据驱动的,用于检测加密流量中数据集特定的捷径特征。通过对原始二进制流量直接进行统计相关性分析,BiasSeeker能够识别可能损害泛化能力的虚假或与环境纠缠的特征,且独立于任何分类器。针对捷径特征的多样性,我们引入了一种系统化的分类方法,并应用了特定类别的验证策略,这些策略能在保留有意义信息的同时减少偏见。我们在三个NTC任务涉及的19个公开数据集上评估了BiasSeeker。通过强调上下文感知的特征选择和数据集特定的诊断,BiasSeeker为理解和解决加密网络流量分类中的捷径学习提供了一个新颖的视角,并提高了人们的认识:特征选择应是模型训练前一个有意为之且对场景敏感的步骤。