Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines - 专知论文

会员服务 ·

0

TPU · Gemma · GPU · Google · 基准 ·

Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines

翻译：暂无翻译

Jatin Kishnani,Mayank Goel,Amit Singh,Pulkit Agrawal,Sairanjan Mishra

We present the first end-to-end demonstration of fine-tuning and serving Google's Gemma 4 31B model on TPU hardware, providing an empirical comparison of TPU and GPU platforms for large language model adaptation. Using LoRA on a Google TPU v5p-8 for training and TPU v6e-8 (Trillium) for inference, we document the full set of code-level adaptations required to port a GPU-native training recipe, built on PyTorch, HuggingFace TRL, and FSDP, to the JAX + Tunix/Qwix stack. These adaptations span mesh configuration, LoRA module naming conventions, sharding annotation corrections, gradient checkpointing, data pipeline restructuring, and a custom Orbax-to-safetensors checkpoint merging procedure. For inference, we detail the vLLM-TPU Docker setup necessary to serve Gemma 4 on v6e-8 and characterize the resulting latency and throughput profile. Compared with a 2xH100 GPU baseline under identical hyperparameters, TPU training completes 1.61x faster at 2.12x lower cost. Inference throughput is within 3% across platforms, while TPU achieves 2x lower time-to-first-token (235 ms vs. 475 ms). Together, the TPU configuration is 1.82x cheaper for a representative train-plus-service workload. Our work removes a critical gap in the open tooling ecosystem and provides practitioners with a reproducible, production-ready recipe for Gemma 4 deployment on TPU infrastructure.

翻译：暂无翻译

0

相关内容

TPU

TPAMI 2024 | 计算机视觉中基于图神经网络和图Transformers的方法和最新进展

TPAMI 2024 | 计算机视觉中基于图神经网络和图Transformers的方法和最新进展

专知会员服务

22+阅读 · 2024年9月11日

大模型安全性，Google DeepMind Nicholas Carlini，附191页slides与视频

大模型安全性，Google DeepMind Nicholas Carlini，附191页slides与视频

专知会员服务

31+阅读 · 2024年7月15日

MiniGPT-4：使用先进的大型语言模型提升 AI 视觉语言理解能力

MiniGPT-4：使用先进的大型语言模型提升 AI 视觉语言理解能力

专知会员服务

42+阅读 · 2023年10月1日

《专用数据处理器（DPU）性能基准评测方法与实现》技术白皮书发布，104页pdf

《专用数据处理器（DPU）性能基准评测方法与实现》技术白皮书发布，104页pdf

专知会员服务

61+阅读 · 2022年8月1日

【AAAI2022】可解释性ViT登场，谷歌AI提出层次嵌套Transformer模型

【AAAI2022】可解释性ViT登场，谷歌AI提出层次嵌套Transformer模型

专知会员服务

29+阅读 · 2022年1月28日

中科院计算所牵头发布《专⽤数据处理器DPU技术白皮书》，94页pdf

中科院计算所牵头发布《专⽤数据处理器DPU技术白皮书》，94页pdf

专知会员服务

91+阅读 · 2021年10月24日

Google-EfficientNet v2来了！更快，更小，更强！

Google-EfficientNet v2来了！更快，更小，更强！

专知会员服务

19+阅读 · 2021年4月4日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

这个项目火了！各种深度学习架构，模型和技巧的集合

这个项目火了！各种深度学习架构，模型和技巧的集合

大数据技术

14+阅读 · 2019年6月13日

谷歌EfficientNet缩放模型，PyTorch实现登热榜

谷歌EfficientNet缩放模型，PyTorch实现登热榜

机器学习算法与Python学习

11+阅读 · 2019年6月4日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

超越标准 GNN ！DeepMind、谷歌提出图匹配网络| ICML最新论文

超越标准 GNN ！DeepMind、谷歌提出图匹配网络| ICML最新论文

新智元

20+阅读 · 2019年5月6日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Hands-on Machine Learning with Scikit-Learn and TensorFlow 学习笔记

Hands-on Machine Learning with Scikit-Learn and TensorFlow 学习笔记

AINLP

12+阅读 · 2018年11月12日

【下载】面向Open AI, TensorFlow, Keras的强化学习书籍《Reinforcement Learning》

【下载】面向Open AI, TensorFlow, Keras的强化学习书籍《Reinforcement Learning》

专知

27+阅读 · 2017年12月17日

【下载】最新TensorFlow专业深度学习实战书籍和代码《Pro Deep Learning with TensorFlow》

【下载】最新TensorFlow专业深度学习实战书籍和代码《Pro Deep Learning with TensorFlow》

专知

37+阅读 · 2017年12月16日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

面向云计算的同态密码关键技术研究

国家自然科学基金

0+阅读 · 2017年12月31日

面向数万处理器的有限元线性方程组与模态多级算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于深度学习的海量截获卫星数据分析技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

噪声不确定下基于计算智能的多跳认知无线电网络协作频谱感知优化

国家自然科学基金

0+阅读 · 2015年12月31日

复杂非完整多自主体网络协同算法设计与性能极限分析

国家自然科学基金

1+阅读 · 2015年12月31日

移动云计算复杂网络环境下任务粒度的应用划分和调度方法

国家自然科学基金

0+阅读 · 2015年12月31日

超高速CMOS数模转换器关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

5G极化码译码算法理论与实现关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

三维片上网络芯片关键设计技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

40纳米工艺MOSFET器件毫米波建模和低功耗电路设计

国家自然科学基金

0+阅读 · 2014年12月31日

Non-Parametric Dual-Manifold Mapping via 8-Bit Bounded Transformation Matrices: Challenging FP-centric Hardware Paradigms in Low-Energy AI

Arxiv

0+阅读 · 6月11日

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Arxiv

0+阅读 · 6月3日

RAFI -- A Ray/Work Forwarding Infrastructure for Data Parallel Multi-Node/Multi-GPU Computing

Arxiv

0+阅读 · 5月28日

ParamSpMM: Adaptive and Efficient Sparse Matrix-Matrix Multiplication on GPUs for GNNs

Arxiv

0+阅读 · 5月15日

Network-Adaptive Cloud Processing for Visual Neuroprostheses

Arxiv

0+阅读 · 5月10日

Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks

Arxiv

10+阅读 · 2022年2月10日

Fine-tune BERT for Extractive Summarization

Arxiv

21+阅读 · 2019年3月25日

Neural Approaches to Conversational AI

Arxiv

26+阅读 · 2018年9月21日

Generative Adversarial Autoencoder Networks

Arxiv

11+阅读 · 2018年3月23日

A Deep Reinforcement Learning Chatbot (Short Version)

Arxiv

13+阅读 · 2018年1月20日

VIP会员

文章信息

相关主题

最新内容

定向能反无人机系统最新发展动态

定向能反无人机系统最新发展动态

专知会员服务

0+阅读 · 32分钟前

从燃煤战舰到算法战争：水面指挥的永恒要求

从燃煤战舰到算法战争：水面指挥的永恒要求

专知会员服务

1+阅读 · 49分钟前

《短程弹道再入飞行器拦截时间中的一项异常现象》

《短程弹道再入飞行器拦截时间中的一项异常现象》

专知会员服务

1+阅读 · 52分钟前

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

专知会员服务

1+阅读 · 54分钟前

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

专知会员服务

1+阅读 · 今天13:13

《韩国国防政策与军备出口：韩国安全与国防政策如何塑造其国防工业与军备出口格局》最新100页报告

《韩国国防政策与军备出口：韩国安全与国防政策如何塑造其国防工业与军备出口格局》最新100页报告

专知会员服务

0+阅读 · 今天13:10

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

专知会员服务

5+阅读 · 6月16日

多模态代码智能综述：从视觉输入到可执行代码系统

多模态代码智能综述：从视觉输入到可执行代码系统

专知会员服务

7+阅读 · 6月16日

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

专知会员服务

5+阅读 · 6月16日

《面向导弹有效发射时机的监督机器学习方法：基于超视距空战仿真》

《面向导弹有效发射时机的监督机器学习方法：基于超视距空战仿真》

专知会员服务

5+阅读 · 6月16日

《通用大语言模型：无人机指挥与控制接口》最新40页

《通用大语言模型：无人机指挥与控制接口》最新40页

专知会员服务

15+阅读 · 6月16日

《通过小型无人机系统将情报能力“作战化”》

《通过小型无人机系统将情报能力“作战化”》

专知会员服务

6+阅读 · 6月16日

《神经安全型有人–无人协同：面向认知自适应作战能力的参考架构》

《神经安全型有人–无人协同：面向认知自适应作战能力的参考架构》

专知会员服务

10+阅读 · 6月16日

《在指挥链中通过多准则决策分析传达指挥官意图：空战实验》

《在指挥链中通过多准则决策分析传达指挥官意图：空战实验》

专知会员服务

21+阅读 · 6月15日

消耗优势：美军的“精确规模化”概念

消耗优势：美军的“精确规模化”概念

专知会员服务

8+阅读 · 6月15日

相关VIP内容

TPAMI 2024 | 计算机视觉中基于图神经网络和图Transformers的方法和最新进展

TPAMI 2024 | 计算机视觉中基于图神经网络和图Transformers的方法和最新进展

专知会员服务

22+阅读 · 2024年9月11日

大模型安全性，Google DeepMind Nicholas Carlini，附191页slides与视频

大模型安全性，Google DeepMind Nicholas Carlini，附191页slides与视频

专知会员服务

31+阅读 · 2024年7月15日

MiniGPT-4：使用先进的大型语言模型提升 AI 视觉语言理解能力

MiniGPT-4：使用先进的大型语言模型提升 AI 视觉语言理解能力

专知会员服务

42+阅读 · 2023年10月1日

《专用数据处理器（DPU）性能基准评测方法与实现》技术白皮书发布，104页pdf

《专用数据处理器（DPU）性能基准评测方法与实现》技术白皮书发布，104页pdf

专知会员服务

61+阅读 · 2022年8月1日

【AAAI2022】可解释性ViT登场，谷歌AI提出层次嵌套Transformer模型

【AAAI2022】可解释性ViT登场，谷歌AI提出层次嵌套Transformer模型

专知会员服务

29+阅读 · 2022年1月28日

中科院计算所牵头发布《专⽤数据处理器DPU技术白皮书》，94页pdf

中科院计算所牵头发布《专⽤数据处理器DPU技术白皮书》，94页pdf

专知会员服务

91+阅读 · 2021年10月24日

Google-EfficientNet v2来了！更快，更小，更强！

Google-EfficientNet v2来了！更快，更小，更强！

专知会员服务

19+阅读 · 2021年4月4日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

从燃煤战舰到算法战争：水面指挥的永恒要求

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

定向能反无人机系统最新发展动态

《短程弹道再入飞行器拦截时间中的一项异常现象》

相关资讯

这个项目火了！各种深度学习架构，模型和技巧的集合

这个项目火了！各种深度学习架构，模型和技巧的集合

大数据技术

14+阅读 · 2019年6月13日

谷歌EfficientNet缩放模型，PyTorch实现登热榜

谷歌EfficientNet缩放模型，PyTorch实现登热榜

机器学习算法与Python学习

11+阅读 · 2019年6月4日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

超越标准 GNN ！DeepMind、谷歌提出图匹配网络| ICML最新论文

超越标准 GNN ！DeepMind、谷歌提出图匹配网络| ICML最新论文

新智元

20+阅读 · 2019年5月6日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Hands-on Machine Learning with Scikit-Learn and TensorFlow 学习笔记

Hands-on Machine Learning with Scikit-Learn and TensorFlow 学习笔记

AINLP

12+阅读 · 2018年11月12日

【下载】面向Open AI, TensorFlow, Keras的强化学习书籍《Reinforcement Learning》

【下载】面向Open AI, TensorFlow, Keras的强化学习书籍《Reinforcement Learning》

专知

27+阅读 · 2017年12月17日

【下载】最新TensorFlow专业深度学习实战书籍和代码《Pro Deep Learning with TensorFlow》

【下载】最新TensorFlow专业深度学习实战书籍和代码《Pro Deep Learning with TensorFlow》

专知

37+阅读 · 2017年12月16日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

相关论文

Non-Parametric Dual-Manifold Mapping via 8-Bit Bounded Transformation Matrices: Challenging FP-centric Hardware Paradigms in Low-Energy AI

Arxiv

0+阅读 · 6月11日

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Arxiv

0+阅读 · 6月3日

RAFI -- A Ray/Work Forwarding Infrastructure for Data Parallel Multi-Node/Multi-GPU Computing

Arxiv

0+阅读 · 5月28日

ParamSpMM: Adaptive and Efficient Sparse Matrix-Matrix Multiplication on GPUs for GNNs

Arxiv

0+阅读 · 5月15日

Network-Adaptive Cloud Processing for Visual Neuroprostheses

Arxiv

0+阅读 · 5月10日

Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks

Arxiv

10+阅读 · 2022年2月10日

Fine-tune BERT for Extractive Summarization

Arxiv

21+阅读 · 2019年3月25日

Neural Approaches to Conversational AI

Arxiv

26+阅读 · 2018年9月21日

Generative Adversarial Autoencoder Networks

Arxiv

11+阅读 · 2018年3月23日

A Deep Reinforcement Learning Chatbot (Short Version)

Arxiv

13+阅读 · 2018年1月20日

相关基金

面向云计算的同态密码关键技术研究

国家自然科学基金

0+阅读 · 2017年12月31日

面向数万处理器的有限元线性方程组与模态多级算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于深度学习的海量截获卫星数据分析技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

噪声不确定下基于计算智能的多跳认知无线电网络协作频谱感知优化

国家自然科学基金

0+阅读 · 2015年12月31日

复杂非完整多自主体网络协同算法设计与性能极限分析

国家自然科学基金

1+阅读 · 2015年12月31日

移动云计算复杂网络环境下任务粒度的应用划分和调度方法

国家自然科学基金

0+阅读 · 2015年12月31日

超高速CMOS数模转换器关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

5G极化码译码算法理论与实现关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

三维片上网络芯片关键设计技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

40纳米工艺MOSFET器件毫米波建模和低功耗电路设计

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员