The success of Large Language Models (LLMs) has led to a parallel rise in the development of Large Multimodal Models (LMMs), such as Gemini-pro, which have begun to transform a variety of applications. These sophisticated multimodal models are designed to interpret and analyze complex data, integrating both textual and visual information on a scale previously unattainable, opening new avenues for a range of applications. This paper investigates the applicability and effectiveness of prompt-engineered Gemini-pro LMMs versus fine-tuned Vision Transformer (ViT) models in addressing critical security challenges. We focus on two distinct tasks: a visually evident task of detecting simple triggers, such as small squares in images, indicative of potential backdoors, and a non-visually evident task of malware classification through visual representations. Our results highlight a significant divergence in performance, with Gemini-pro falling short in accuracy and reliability when compared to fine-tuned ViT models. The ViT models, on the other hand, demonstrate exceptional accuracy, achieving near-perfect performance on both tasks. This study not only showcases the strengths and limitations of prompt-engineered LMMs in cybersecurity applications but also emphasizes the unmatched efficacy of fine-tuned ViT models for precise and dependable tasks.
翻译:大型语言模型(LLMs)的成功带动了大型多模态模型(LMMs)的并行发展,例如Gemini-pro等模型已开始变革各类应用场景。这些复杂多模态模型旨在解读和分析融合文本与视觉信息的复杂数据,其规模前所未有,为诸多应用开辟了新途径。本文研究了基于提示工程的Gemini-pro LMMs与微调视觉Transformer(ViT)模型在解决关键安全挑战中的适用性与有效性。我们聚焦于两项不同任务:其一是视觉显性任务——检测图像中指示潜在后门的简单触发器(如小方块);其二是非视觉显性任务——通过视觉表征进行恶意软件分类。实验结果表明,两者性能存在显著差异:与微调ViT模型相比,Gemini-pro在准确性和可靠性方面表现不足,而ViT模型则展现出卓越精度,在两项任务上均近乎实现完美表现。本研究不仅揭示了基于提示工程的LMMs在网络安全应用中的优势与局限,更强调了微调ViT模型在精准可靠任务中无可比拟的效能。