Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

Network traffic, as a key media format, is crucial for ensuring security and communications in modern internet infrastructure. While existing methods offer excellent performance, they face two key bottlenecks: (1) They fail to capture multidimensional semantics beyond unimodal sequence patterns. (2) Their black box property, i.e., providing only category labels, lacks an auditable reasoning process. We identify a key factor that existing network traffic datasets are primarily designed for classification and inherently lack rich semantic annotations, failing to generate human-readable evidence report. To address data scarcity, this paper proposes a Byte-Grounded Traffic Description (BGTD) benchmark for the first time, combining raw bytes with structured expert annotations. BGTD provides necessary behavioral features and verifiable chains of evidence for multimodal reasoning towards explainable encrypted traffic interpretation. Built upon BGTD, this paper proposes an end-to-end traffic-language representation framework (mmTraffic), a multimodal reasoning architecture bridging physical traffic encoding and semantic interpretation. In order to alleviate modality interference and generative hallucinations, mmTraffic adopts a jointly-optimized perception-cognition architecture. By incorporating a perception-centered traffic encoder and a cognition-centered LLM generator, mmTraffic achieves refined traffic interpretation with guaranteed category prediction. Extensive experiments demonstrate that mmTraffic autonomously generates high-fidelity, human-readable, and evidence-grounded traffic interpretation reports, while maintaining highly competitive classification accuracy comparing to specialized unimodal model (e.g., NetMamba). The source code is available at https://github.com/lgzhangzlg/Multimodal-Reasoning-with-LLM-for-Encrypted-Traffic-Interpretation-A-Benchmark

翻译：网络流量作为关键媒体格式，对保障现代互联网基础设施的安全与通信至关重要。现有方法虽然性能优异，但面临两个关键瓶颈：(1) 无法捕捉单模态序列模式之外的多维语义；(2) 其黑盒特性（即仅提供类别标签）缺乏可审计的推理过程。我们发现，现有网络流量数据集主要面向分类任务设计，本质上缺乏丰富的语义标注，无法生成人类可读的证据报告。为解决数据稀缺问题，本文首次提出字节级流量描述基准(BGTD)，该基准将原始字节与结构化专家标注相结合。BGTD为面向可解释加密流量分析的多模态推理提供了必要的行为特征与可验证证据链。基于BGTD，本文构建了端到端流量-语言表示框架(mmTraffic)——一种连接物理流量编码与语义解释的多模态推理架构。为缓解模态干扰与生成幻觉，mmTraffic采用联合优化的感知-认知架构。通过集成以感知为中心的流量编码器与以认知为中心的大语言模型生成器，mmTraffic在保证类别预测准确性的同时实现了精细化的流量解释。大量实验表明，mmTraffic能够自主生成高保真、人类可读且基于证据的流量解释报告，同时在与专用单模态模型（如NetMamba）的对比中保持极具竞争力的分类精度。源代码已开源至https://github.com/lgzhangzlg/Multimodal-Reasoning-with-LLM-for-Encrypted-Traffic-Interpretation-A-Benchmark