This work presents a multi-layered methodology for efficiently accelerating multimodal foundation models (MFMs). It combines hardware and software co-design of transformer blocks with an optimization pipeline that reduces computational and memory requirements. During model development, it employs performance enhancements through fine-tuning for domain-specific adaptation. Our methodology further incorporates hardware and software techniques for optimizing MFMs. Specifically, it employs MFM compression using hierarchy-aware mixed-precision quantization and structural pruning for transformer blocks and MLP channels. It also optimizes operations through speculative decoding, model cascading that routes queries through a small-to-large cascade and uses lightweight self-tests to determine when to escalate to larger models, as well as co-optimization of sequence length, visual resolution & stride, and graph-level operator fusion. To efficiently execute the model, the processing dataflow is optimized based on the underlying hardware architecture together with memory-efficient attention to meet on-chip bandwidth and latency budgets. To support this, a specialized hardware accelerator for the transformer workloads is employed, which can be developed through expert design or an LLM-aided design approach. We demonstrate the effectiveness of the proposed methodology on medical-MFMs and on code generation tasks, and conclude with extensions toward energy-efficient spiking-MFMs.
翻译:本文提出了一种多层方法学,用于高效加速多模态基础模型。该方法结合了Transformer模块的硬件与软件协同设计,以及一个减少计算和内存需求的优化流水线。在模型开发阶段,通过微调进行领域特定适应性优化,以提升性能。我们的方法学进一步融入硬件与软件技术来优化多模态基础模型。具体地,它采用基于层级感知的混合精度量化和结构化剪枝进行多模态基础模型压缩,针对Transformer模块和MLP通道。此外,通过推测解码、模型级联(将查询路由至小到大级联,并利用轻量级自测试判断何时升级至更大模型)、以及序列长度、视觉分辨率与步长和图形级算子融合的协同优化,来优化操作。为高效执行模型,处理数据流根据底层硬件架构进行优化,并结合内存高效注意力机制,以符合片上带宽和延迟预算。为此,采用了专用硬件加速器处理Transformer工作负载,可通过专家设计或大语言模型辅助设计方法开发。我们通过医学多模态基础模型和代码生成任务展示了所提方法的有效性,并最后探讨了向低能耗脉冲多模态基础模型的扩展。