VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices - 专知论文

会员服务 ·

0

VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices

翻译：暂无翻译

Zi-Wei Lin,Tian-Sheuan Chang

from arxiv, accepted in ISCAS 2026

We present VitaLLM, a mixed precision accelerator that enables ternary weight large language models to run efficiently on edge devices. The design combines two compute cores, a multiplier free TINT core for ternary-INT projections and a BoothFlex core that reuses a radix-4 Booth datapath for both INT8$\times$INT8 attention and ternary-INT-sustaining utilization without duplicating arrays. A predictive sparse attention mechanism employs a leading-one (LO) surrogate with a comparison-free top-$K$ selector to prune key/value (KV) fetches by roughly $1-K/M$ for $M$ cached tokens, confining exact attention to $K$ candidates. System-level integration uses head-level pipelining and an absmax-based quantization barrier to standardize cross-core interfaces and overlap nonlinear reductions with linear tiles. A 16 nm silicon prototype at 1 GHz/0.8 V achieves 72.46 tokens/s in decode and 0.88 s prefill (64 tokens) within 0.214 mm^2 and 120 KB on-chip memory, while reducing KV traffic and improving utilization in ablations. These results demonstrate practical BitNet b1.58 (3B) inference on edge-class platforms and provide a compact blueprint for future mixed-precision LLM accelerators.

翻译：暂无翻译

0

相关内容

EdgeRunner AI：在本地设备关键军事任务中实现GPT-5级性能表现（附论文）

EdgeRunner AI：在本地设备关键军事任务中实现GPT-5级性能表现（附论文）

专知会员服务

27+阅读 · 2025年11月19日

[ICML2024] Spotlight|DAT：通过交互式注意力实现统一的多粒度文本检测

[ICML2024] Spotlight|DAT：通过交互式注意力实现统一的多粒度文本检测

专知会员服务

19+阅读 · 2024年6月26日

WSDM 2024| LLMs助力图学习？基于大模型的图数据增强

WSDM 2024| LLMs助力图学习？基于大模型的图数据增强

专知会员服务

27+阅读 · 2023年11月19日

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【ACL2020】DeeBERT:动态加速BERT推理，DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

【ACL2020】DeeBERT:动态加速BERT推理，DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

专知会员服务

21+阅读 · 2020年4月30日

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

专知会员服务

36+阅读 · 2020年4月14日

【论文】边缘计算:对当前计划的全面调查和可持续边缘计算发展的路线图（Edge Computing: A Comprehensive Surveyof Current Initiativesand a Roadmap for a Sustainable Edge Computing Development）

【论文】边缘计算:对当前计划的全面调查和可持续边缘计算发展的路线图（Edge Computing: A Comprehensive Surveyof Current Initiativesand a Roadmap for a Sustainable Edge Computing Development）

专知会员服务

29+阅读 · 2019年12月19日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

NLP领域最近比较火的Prompt，能否借鉴到多模态领域？一文跟进最新进展

NLP领域最近比较火的Prompt，能否借鉴到多模态领域？一文跟进最新进展

PaperWeekly

17+阅读 · 2022年3月8日

最新必读【预训练语言模型(BERT/XLNet等)】论文，Google/微软/华为ICLR2020提交论文

最新必读【预训练语言模型(BERT/XLNet等)】论文，Google/微软/华为ICLR2020提交论文

专知

36+阅读 · 2019年9月29日

初学者系列：Attentional Factorization Machines（AFM）详解

初学者系列：Attentional Factorization Machines（AFM）详解

专知

82+阅读 · 2019年9月16日

微软亚研提出VL-BERT：通用的视觉-语言预训练模型

微软亚研提出VL-BERT：通用的视觉-语言预训练模型

机器之心

15+阅读 · 2019年9月3日

【泡泡图灵智库】PL-VIO：使用点和线特征的紧耦合单目视觉惯性里程计

【泡泡图灵智库】PL-VIO：使用点和线特征的紧耦合单目视觉惯性里程计

泡泡机器人SLAM

54+阅读 · 2019年7月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

每日论文 | Twin-GAN实现不同次元的图像的迁移；半监督学习实际效果调查；对深度学习进行长度描述

每日论文 | Twin-GAN实现不同次元的图像的迁移；半监督学习实际效果调查；对深度学习进行长度描述

论智

13+阅读 · 2018年9月6日

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

AI研习社

21+阅读 · 2018年6月14日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

基于加速网的光电混合三维互连架构设计方法研究

国家自然科学基金

0+阅读 · 2017年12月31日

分叉双层微流控系统中液滴分裂机理与乳化制备微粒性能的研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向三维集成的基于纳米颗粒修饰的晶圆级互连技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

大信号及宽带调制信号激励下AlGaN/GaN HEMT功率器件行为模型建模方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

球形储能腔能量倍增器的研制

国家自然科学基金

0+阅读 · 2015年12月31日

高速宽带TIADC并行采集系统非均匀失配动态补偿研究

国家自然科学基金

0+阅读 · 2015年12月31日

由偏振标记，由光纤远程柔性、共路传输的二合一固体微片激光回馈干涉仪

国家自然科学基金

0+阅读 · 2014年12月31日

三维连续集成集成电路关键工艺技术和机理研究

国家自然科学基金

0+阅读 · 2014年12月31日

具有沟槽-场限环复合终端双芯GCT的关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

硅衬底上III-V族异质结材料生长机制和HEMT器件制备研究

国家自然科学基金

0+阅读 · 2014年12月31日

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

Arxiv

0+阅读 · 4月30日

SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission

Arxiv

0+阅读 · 4月28日

Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

Arxiv

0+阅读 · 4月24日

Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers

Arxiv

0+阅读 · 4月24日

AgenTEE: Confidential LLM Agent Execution on Edge Devices

Arxiv

0+阅读 · 4月20日

Rethinking LLM-Driven Heuristic Design: Generating Efficient and Specialized Solvers via Dynamics-Aware Optimization

Arxiv

0+阅读 · 4月16日

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Arxiv

0+阅读 · 4月14日

From LLM to Silicon: RL-Driven ASIC Architecture Exploration for On-Device AI Inference

Arxiv

0+阅读 · 4月8日

MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference

Arxiv

0+阅读 · 4月3日

A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference

Arxiv

0+阅读 · 3月27日

VIP会员

文章信息

相关主题

最新内容

DeepSeek 版Claude Code，免费小白安装教程来了！

DeepSeek 版Claude Code，免费小白安装教程来了！

专知会员服务

7+阅读 · 5月5日

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

专知会员服务

3+阅读 · 5月5日

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

专知会员服务

3+阅读 · 5月5日

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

专知会员服务

4+阅读 · 5月5日

《火炮弹药快速效能建模：提升互操作性与技术优势》（报告）

《火炮弹药快速效能建模：提升互操作性与技术优势》（报告）

专知会员服务

6+阅读 · 5月5日

《美空军条令出版物 2-0：情报（2026版）》

《美空军条令出版物 2-0：情报（2026版）》

专知会员服务

12+阅读 · 5月5日

美陆军“飞蝇陷阱5.0”项目将新兴技术交到作战人员手中

美陆军“飞蝇陷阱5.0”项目将新兴技术交到作战人员手中

专知会员服务

4+阅读 · 5月5日

帕兰提尔 Gotham：一个游戏规则改变器

帕兰提尔 Gotham：一个游戏规则改变器

专知会员服务

6+阅读 · 5月5日

【ICML 2026】用测试时训练线性化视觉Transformer：T⁵ 实现 Softmax 注意力到线性复杂度的快速转换

【ICML 2026】用测试时训练线性化视觉Transformer：T⁵ 实现 Softmax 注意力到线性复杂度的快速转换

专知会员服务

2+阅读 · 5月5日

【AAAI 2026】大模型做知识蒸馏：CMM将LLM特征拆解给小模型协同学习

【AAAI 2026】大模型做知识蒸馏：CMM将LLM特征拆解给小模型协同学习

专知会员服务

2+阅读 · 5月5日

【ICML Spotlight 2026 】NonZero：交互引导探索的多智能体蒙特卡洛树搜索

【ICML Spotlight 2026 】NonZero：交互引导探索的多智能体蒙特卡洛树搜索

专知会员服务

8+阅读 · 5月4日

【综述】机器人学习中的世界模型：全面综述

【综述】机器人学习中的世界模型：全面综述

专知会员服务

11+阅读 · 5月4日

伊朗的导弹-无人机行动及其对美国威慑的影响

伊朗的导弹-无人机行动及其对美国威慑的影响

专知会员服务

9+阅读 · 5月4日

《未来战术无人机系统案例研究：量身定制采办策略方法》100页报告

《未来战术无人机系统案例研究：量身定制采办策略方法》100页报告

专知会员服务

9+阅读 · 5月4日

战争贩子：2026年第一季度美国对中东潜在军售激增

战争贩子：2026年第一季度美国对中东潜在军售激增

专知会员服务

7+阅读 · 5月4日

相关VIP内容

EdgeRunner AI：在本地设备关键军事任务中实现GPT-5级性能表现（附论文）

EdgeRunner AI：在本地设备关键军事任务中实现GPT-5级性能表现（附论文）

专知会员服务

27+阅读 · 2025年11月19日

[ICML2024] Spotlight|DAT：通过交互式注意力实现统一的多粒度文本检测

[ICML2024] Spotlight|DAT：通过交互式注意力实现统一的多粒度文本检测

专知会员服务

19+阅读 · 2024年6月26日

WSDM 2024| LLMs助力图学习？基于大模型的图数据增强

WSDM 2024| LLMs助力图学习？基于大模型的图数据增强

专知会员服务

27+阅读 · 2023年11月19日

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【ACL2020】DeeBERT:动态加速BERT推理，DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

【ACL2020】DeeBERT:动态加速BERT推理，DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

专知会员服务

21+阅读 · 2020年4月30日

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

专知会员服务

36+阅读 · 2020年4月14日

【论文】边缘计算:对当前计划的全面调查和可持续边缘计算发展的路线图（Edge Computing: A Comprehensive Surveyof Current Initiativesand a Roadmap for a Sustainable Edge Computing Development）

【论文】边缘计算:对当前计划的全面调查和可持续边缘计算发展的路线图（Edge Computing: A Comprehensive Surveyof Current Initiativesand a Roadmap for a Sustainable Edge Computing Development）

专知会员服务

29+阅读 · 2019年12月19日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

DeepSeek 版Claude Code，免费小白安装教程来了！

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

相关资讯

NLP领域最近比较火的Prompt，能否借鉴到多模态领域？一文跟进最新进展

NLP领域最近比较火的Prompt，能否借鉴到多模态领域？一文跟进最新进展

PaperWeekly

17+阅读 · 2022年3月8日

最新必读【预训练语言模型(BERT/XLNet等)】论文，Google/微软/华为ICLR2020提交论文

最新必读【预训练语言模型(BERT/XLNet等)】论文，Google/微软/华为ICLR2020提交论文

专知

36+阅读 · 2019年9月29日

初学者系列：Attentional Factorization Machines（AFM）详解

初学者系列：Attentional Factorization Machines（AFM）详解

专知

82+阅读 · 2019年9月16日

微软亚研提出VL-BERT：通用的视觉-语言预训练模型

微软亚研提出VL-BERT：通用的视觉-语言预训练模型

机器之心

15+阅读 · 2019年9月3日

【泡泡图灵智库】PL-VIO：使用点和线特征的紧耦合单目视觉惯性里程计

【泡泡图灵智库】PL-VIO：使用点和线特征的紧耦合单目视觉惯性里程计

泡泡机器人SLAM

54+阅读 · 2019年7月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

每日论文 | Twin-GAN实现不同次元的图像的迁移；半监督学习实际效果调查；对深度学习进行长度描述

每日论文 | Twin-GAN实现不同次元的图像的迁移；半监督学习实际效果调查；对深度学习进行长度描述

论智

13+阅读 · 2018年9月6日

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

AI研习社

21+阅读 · 2018年6月14日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

相关论文

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

Arxiv

0+阅读 · 4月30日

SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission

Arxiv

0+阅读 · 4月28日

Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

Arxiv

0+阅读 · 4月24日

Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers

Arxiv

0+阅读 · 4月24日

AgenTEE: Confidential LLM Agent Execution on Edge Devices

Arxiv

0+阅读 · 4月20日

Rethinking LLM-Driven Heuristic Design: Generating Efficient and Specialized Solvers via Dynamics-Aware Optimization

Arxiv

0+阅读 · 4月16日

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Arxiv

0+阅读 · 4月14日

From LLM to Silicon: RL-Driven ASIC Architecture Exploration for On-Device AI Inference

Arxiv

0+阅读 · 4月8日

MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference

Arxiv

0+阅读 · 4月3日

A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference

Arxiv

0+阅读 · 3月27日

相关基金

基于加速网的光电混合三维互连架构设计方法研究

国家自然科学基金

0+阅读 · 2017年12月31日

分叉双层微流控系统中液滴分裂机理与乳化制备微粒性能的研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向三维集成的基于纳米颗粒修饰的晶圆级互连技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

大信号及宽带调制信号激励下AlGaN/GaN HEMT功率器件行为模型建模方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

球形储能腔能量倍增器的研制

国家自然科学基金

0+阅读 · 2015年12月31日

高速宽带TIADC并行采集系统非均匀失配动态补偿研究

国家自然科学基金

0+阅读 · 2015年12月31日

由偏振标记，由光纤远程柔性、共路传输的二合一固体微片激光回馈干涉仪

国家自然科学基金

0+阅读 · 2014年12月31日

三维连续集成集成电路关键工艺技术和机理研究

国家自然科学基金

0+阅读 · 2014年12月31日

具有沟槽-场限环复合终端双芯GCT的关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

硅衬底上III-V族异质结材料生长机制和HEMT器件制备研究

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员