In this work, we introduce FOCA, a novel multimodal framework for malware classification that jointly leverages audio and visual modalities. Unlike conventional Euclidean-based fusion methods, FOCA is the first to exploit the intrinsic hierarchical relationships between audio and visual representations within hyperbolic space. To achieve this, raw binaries are transformed into both audio and visual representations, which are then processed through three key components: (i) a hyperbolic projection module that maps Euclidean embeddings into the Poincare ball, (ii) a hyperbolic cross-attention mechanism that aligns multimodal dependencies under curvature-aware constraints, and (iii) a Mobius addition-based fusion layer. Comprehensive experiments on two benchmark datasets-Mal-Net and CICMalDroid2020- show that FOCA consistently outperforms unimodal models, surpasses most Euclidean multimodal baselines, and achieves state-of-the-art performance over existing works.
翻译:本文提出了一种新颖的多模态恶意软件分类框架FOCA,该框架联合利用音频与视觉模态。与传统的基于欧几里得空间的融合方法不同,FOCA首次在双曲空间中挖掘音频与视觉表征之间固有的层次关系。为实现这一目标,原始二进制文件被转换为音频和视觉表征,随后通过三个核心组件进行处理:(i) 一个将欧几里得嵌入映射到庞加莱球的双曲投影模块,(ii) 一种在曲率感知约束下对齐多模态依赖关系的双曲交叉注意力机制,以及(iii) 一个基于莫比乌斯加法的融合层。在两个基准数据集——Mal-Net与CICMalDroid2020——上的综合实验表明,FOCA始终优于单模态模型,超越了大多数欧几里得多模态基线方法,并在现有工作中取得了最先进的性能。