Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification

Self-supervised masked modeling shows promise for encrypted traffic classification by masking and reconstructing raw bytes. Yet recent work reveals these methods fail to reduce reliance on labeled data despite costly pretraining: under frozen encoder evaluation, accuracy drops from greater than 0.9 to less than 0.47. We argue the root cause is inductive bias mismatch: flattening traffic into byte sequences destroys protocol-defined semantics. We identify three specific issues: 1) field unpredictability, random fields like ip.id are unlearnable yet treated as reconstruction targets; 2) embedding confusion, semantically distinct fields collapse into a unified embedding space; 3) metadata loss, capture-time metadata essential for temporal analysis is discarded. To address this, we propose a protocol-native paradigm that treats protocol-defined field semantics as architectural priors, reformulating the task to align with the data's intrinsic tabular modality rather than incrementally adapting sequence-based architectures. Instantiating this paradigm, we introduce FlowSem-MAE, a tabular masked autoencoder built on Flow Semantic Units (FSUs). It features predictability-guided filtering that focuses on learnable FSUs, FSU-specific embeddings to preserve field boundaries, and dual-axis attention to capture intra-packet and temporal patterns. FlowSem-MAE significantly outperforms state-of-the-art across datasets. With only half labeled data, it outperforms most existing methods trained on full data.

翻译：自监督掩码建模通过对原始字节进行掩码和重建，在加密流量分类任务中展现出应用前景。然而近期研究表明，尽管进行了代价高昂的预训练，这些方法仍未能减少对标注数据的依赖：在冻结编码器评估场景下，准确率从大于0.9骤降至不足0.47。我们认为根本原因在于归纳偏置不匹配：将流量扁平化为字节序列会破坏协议定义的语义结构。我们识别出三个具体问题：1）字段不可预测性，如ip.id等随机字段虽不可学习却被当作重建目标；2）嵌入混淆，语义不同的字段被压缩至统一嵌入空间；3）元数据丢失，对时序分析至关重要的捕获时元数据被丢弃。为解决这些问题，我们提出一种协议原生范式，将协议定义的字段语义作为架构先验，将任务重新定义为与数据固有的表格模态对齐，而非渐进式调整基于序列的架构。基于该范式，我们提出FlowSem-MAE——一种构建于流语义单元（FSU）之上的表格掩码自编码器。该模型具有三个核心特征：可预测性引导过滤机制聚焦可学习FSU、FSU专用嵌入保留字段边界、双轴注意力捕捉包内与时序模式。FlowSem-MAE在多个数据集上显著优于现有最优方法。仅使用半数标注数据，其性能即可超越多数基于完整数据训练的现有方法。