We present a federated, multi-modal phishing website detector that supports URL, HTML, and IMAGE inputs without binding clients to a fixed modality at inference: any client can invoke any modality head trained elsewhere. Methodologically, we propose role-aware bucket aggregation on top of FedProx, inspired by Mixture-of-Experts and FedMM. We drop learnable routing and use hard gating (selecting the IMAGE/HTML/URL expert by sample modality), enabling separate aggregation of modality-specific parameters to isolate cross-embedding conflicts and stabilize convergence. On TR-OP, the Fusion head reaches Acc 97.5% with FPR 2.4% across two data types; on the image subset (ablation) it attains Acc 95.5% with FPR 5.9%. For text, we use GraphCodeBERT for URLs and an early three-way embedding for raw, noisy HTML. On WebPhish (HTML) we obtain Acc 96.5% / FPR 1.8%; on TR-OP (raw HTML) we obtain Acc 95.1% / FPR 4.6%. Results indicate that bucket aggregation with hard-gated experts enables stable federated training under strict privacy, while improving the usability and flexibility of multi-modal phishing detection.
翻译:我们提出了一种支持URL、HTML和图像输入的多模态钓鱼网站检测器,其采用联邦学习框架,且不将客户端绑定于固定的推理模态:任何客户端均可调用在其他地方训练的任何模态头部。在方法论上,我们受混合专家系统与FedMM的启发,在FedProx基础上提出了基于角色感知的分桶聚合策略。我们摒弃可学习的路由机制,采用硬门控(根据样本模态选择图像/HTML/URL专家),从而实现对模态特定参数的独立聚合,以隔离跨嵌入冲突并稳定收敛过程。在TR-OP数据集上,融合头部在两种数据类型上达到准确率97.5%与误报率2.4%;在图像子集(消融实验)中达到准确率95.5%与误报率5.9%。对于文本模态,我们采用GraphCodeBERT处理URL,并对原始噪声HTML使用早期三路嵌入。在WebPhish(HTML)数据集上获得准确率96.5%/误报率1.8%;在TR-OP(原始HTML)数据集上获得准确率95.1%/误报率4.6%。实验结果表明,采用硬门控专家的分桶聚合策略能够在严格隐私约束下实现稳定的联邦训练,同时提升多模态钓鱼检测的实用性与灵活性。