The African Language Tax: Quantifying the Cost, Latency, and Context Penalty of Tokenizing African Languages in Frontier LLMs - 专知论文

会员服务 ·

0

词元分析器 · 代价 · MoDELS · 均值 · GPT-5 ·

The African Language Tax: Quantifying the Cost, Latency, and Context Penalty of Tokenizing African Languages in Frontier LLMs

翻译：暂无翻译

Olaoye Anthony Somide

from arxiv, 40 pages, 5 figures, 25 tables

Commercial large language models bill, scale latency, and budget context per token. Yet tokenizers assign more subword tokens to the same meaning in some languages than in others, so speakers of languages with high token-fertility pay a structural penalty before a model is ever invoked. This penalty is documented for multilingual settings in general, but it has not been measured systematically for African languages at the level of enterprise deployment economics and cognitive context capacity. We measure it across 20 African languages spanning five language families and three scripts (Latin, Ge'ez/Ethiopic, N'Ko; 19 appear in the primary FLORES-200+ corpus, with Nigerian Pidgin measured via MAFAND-MT only), using parallel corpora so that the language effect is isolated from content. Across 11 frontier and open tokenizers on FLORES-200+, every African language carries a tokenization premium above English (median 1.88x on GPT-5 / o200k_base, up to 8.92x for N'Ko); the penalty is largest for Ethiopic and N'Ko scripts (reaching 7-9x) and is near-invariant across corpora (FLORES vs SIB-200 Pearson r = 0.9998). Translated into deployment terms, this results in up to 8.9x inference cost and an equivalent generation-latency multiplier (N'Ko vs English on GPT-5; 7.4x for Amharic), and as little as 11% of English's effective context window. The best currently available tokenizer for African languages, Gemma 4, reduces the mean premium from 3.31x (cl100k_base) to 2.38x, but no tokenizer eliminates the penalty. We release an open measurement tool (afri-fertility), a public leaderboard, a results dataset, and mitigation guidance for African builders. The penalty falls hardest on the languages whose speakers can least afford it, a digital divide encoded directly into the subword vocabulary.

翻译：暂无翻译

0

相关内容

词元分析器

词元分析器

[ICML 2026] SOL：让大模型把算力花在关键Token上：自优化语言模型

[ICML 2026] SOL：让大模型把算力花在关键Token上：自优化语言模型

专知会员服务

7+阅读 · 5月12日

EMNLP 2024 | 大语言模型的概念知识编辑

EMNLP 2024 | 大语言模型的概念知识编辑

专知会员服务

21+阅读 · 2024年12月13日

大型语言模型（LLMs），附Slides与视频

大型语言模型（LLMs），附Slides与视频

专知会员服务

71+阅读 · 2024年6月30日

ICLR2024 | 语言模型知识编辑的鲁棒性研究

ICLR2024 | 语言模型知识编辑的鲁棒性研究

专知会员服务

18+阅读 · 2024年3月15日

RAG+LLM=？同济大学等最新《大型语言模型的检索增强生成》综述

RAG+LLM=？同济大学等最新《大型语言模型的检索增强生成》综述

专知会员服务

111+阅读 · 2023年12月19日

字节跳动李航提出AMBERT！超越BERT！多粒度token预训练语言模型

字节跳动李航提出AMBERT！超越BERT！多粒度token预训练语言模型

专知会员服务

41+阅读 · 2020年8月31日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【论文推荐】将机器语言模型扩展到人类级别的语言理解，Extending Machine Language Models toward Human-Level Language Understanding

【论文推荐】将机器语言模型扩展到人类级别的语言理解，Extending Machine Language Models toward Human-Level Language Understanding

专知会员服务

18+阅读 · 2019年12月14日

【NLP| 推荐文章】语言语音处理（Speech and Language Processing(3rd ed.draft)）

专知会员服务

15+阅读 · 2019年11月24日

Natural Language Interface to Knowledge Graph (our experience) ，加州大学圣塔芭芭拉分校严锡峰副教授，CIPS ATT 16（2019）

Natural Language Interface to Knowledge Graph (our experience) ，加州大学圣塔芭芭拉分校严锡峰副教授，CIPS ATT 16（2019）

专知会员服务

16+阅读 · 2019年10月25日

白话attention综述（上）

白话attention综述（上）

AINLP

12+阅读 · 2019年12月14日

从One-hot, Word embedding到Transformer，一步步教你理解Bert

从One-hot, Word embedding到Transformer，一步步教你理解Bert

AI100

15+阅读 · 2019年6月25日

NLP中的词向量对比：word2vec/glove/fastText/elmo/GPT/bert

NLP中的词向量对比：word2vec/glove/fastText/elmo/GPT/bert

AINLP

31+阅读 · 2019年6月1日

中文对比英文自然语言处理NLP的区别综述

中文对比英文自然语言处理NLP的区别综述

AINLP

18+阅读 · 2019年3月20日

近期语音类前沿论文

近期语音类前沿论文

深度学习每日摘要

14+阅读 · 2019年3月17日

上百种预训练中文词向量：Chinese-Word-Vectors

上百种预训练中文词向量：Chinese-Word-Vectors

AINLP

23+阅读 · 2019年2月26日

NLP Chinese Corpus：大规模中文自然语言处理语料

NLP Chinese Corpus：大规模中文自然语言处理语料

PaperWeekly

14+阅读 · 2019年2月18日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

自然语言处理 | 使用Spacy 进行自然语言处理（二）

自然语言处理 | 使用Spacy 进行自然语言处理（二）

机器学习和数学

10+阅读 · 2018年8月27日

100+中文词向量，总有一款适合你

100+中文词向量，总有一款适合你

专知

12+阅读 · 2018年5月13日

基于犹豫模糊语言信息的定性决策理论与方法

国家自然科学基金

2+阅读 · 2015年12月31日

强调与对比影响语篇理解的认知过程及其神经机制

国家自然科学基金

4+阅读 · 2015年12月31日

维吾尔语韵律结构的分析与预测模型的研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向汉语-泰语跨语言新闻事件检索方法研究

国家自然科学基金

2+阅读 · 2014年12月31日

笔迹图像中关键词语过滤技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

中国进口增长及其对国内经济发展促进作用研究

国家自然科学基金

0+阅读 · 2014年12月31日

柬埔寨语命名实体识别及汉柬双语可比语料库构建方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

维吾尔文印刷文档图像中不良信息过滤关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向连续语音的哈萨克语关键词识别技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

维吾尔语单元集优化关键技术研究及其在语音识别中的应用

国家自然科学基金

0+阅读 · 2014年12月31日

AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

Arxiv

0+阅读 · 6月23日

Curiosity as Linguistic Intervention: Using LLM Tutoring Dialogues to Influence Exploratory Learning Behavior

Arxiv

0+阅读 · 6月21日

Fine-Tuning Large Language Models for Quantum Reasoning

Arxiv

0+阅读 · 6月20日

Beyond Value Benchmarks: Measuring Value-Structure Alignment in Large Language Models via Symmetric Q-Sorts

Arxiv

0+阅读 · 6月20日

The Language-Energy Divide: Measuring Energy Costs of Multilingual LLM Inference

Arxiv

0+阅读 · 6月20日

Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights

Arxiv

0+阅读 · 6月19日

Generalization of Fine-Tuned Uncertainty Communication and Metacognition in Large Language Models

Arxiv

0+阅读 · 6月18日

A Survey of On-Policy Distillation for Large Language Models

Arxiv

0+阅读 · 6月18日

Enumeration in the lattice of $q$-decreasing words

Arxiv

0+阅读 · 6月18日

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

Arxiv

0+阅读 · 6月17日

VIP会员

文章信息

相关主题

词元分析器

最新内容

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

专知会员服务

3+阅读 · 今天6:30

网状网络及其在军事领域的运用

网状网络及其在军事领域的运用

专知会员服务

4+阅读 · 今天6:18

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

专知会员服务

4+阅读 · 今天6:08

无美国参与的欧洲战争方式（万字长文）

无美国参与的欧洲战争方式（万字长文）

专知会员服务

4+阅读 · 今天5:54

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

专知会员服务

4+阅读 · 今天5:22

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

专知会员服务

5+阅读 · 今天5:15

《国防领域敏感性分析白皮书》

《国防领域敏感性分析白皮书》

专知会员服务

5+阅读 · 今天3:42

综述 | 从问答到任务完成：Agent系统与Harness设计

综述 | 从问答到任务完成：Agent系统与Harness设计

专知会员服务

4+阅读 · 6月24日

Agentic RL：框架、实践与长程智能体训练

Agentic RL：框架、实践与长程智能体训练

专知会员服务

3+阅读 · 6月24日

反无人机拦截器训练与运用课程：对美国陆军部队发展的启示

反无人机拦截器训练与运用课程：对美国陆军部队发展的启示

专知会员服务

9+阅读 · 6月24日

重新思考无人机时代的生存能力

重新思考无人机时代的生存能力

专知会员服务

8+阅读 · 6月24日

装甲突击旅：现代战争思考、战斗与组织

装甲突击旅：现代战争思考、战斗与组织

专知会员服务

6+阅读 · 6月24日

在人工智能加速决策环境中拓展OODA循环

在人工智能加速决策环境中拓展OODA循环

专知会员服务

8+阅读 · 6月24日

《廉价自杀式无人机战争的军事战略影响：乌克兰与伊朗案例研究》

《廉价自杀式无人机战争的军事战略影响：乌克兰与伊朗案例研究》

专知会员服务

7+阅读 · 6月24日

军事欺骗：供作战战术指挥官使用的工具

军事欺骗：供作战战术指挥官使用的工具

专知会员服务

6+阅读 · 6月24日

相关VIP内容

[ICML 2026] SOL：让大模型把算力花在关键Token上：自优化语言模型

[ICML 2026] SOL：让大模型把算力花在关键Token上：自优化语言模型

专知会员服务

7+阅读 · 5月12日

EMNLP 2024 | 大语言模型的概念知识编辑

EMNLP 2024 | 大语言模型的概念知识编辑

专知会员服务

21+阅读 · 2024年12月13日

大型语言模型（LLMs），附Slides与视频

大型语言模型（LLMs），附Slides与视频

专知会员服务

71+阅读 · 2024年6月30日

ICLR2024 | 语言模型知识编辑的鲁棒性研究

ICLR2024 | 语言模型知识编辑的鲁棒性研究

专知会员服务

18+阅读 · 2024年3月15日

RAG+LLM=？同济大学等最新《大型语言模型的检索增强生成》综述

RAG+LLM=？同济大学等最新《大型语言模型的检索增强生成》综述

专知会员服务

111+阅读 · 2023年12月19日

字节跳动李航提出AMBERT！超越BERT！多粒度token预训练语言模型

字节跳动李航提出AMBERT！超越BERT！多粒度token预训练语言模型

专知会员服务

41+阅读 · 2020年8月31日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【论文推荐】将机器语言模型扩展到人类级别的语言理解，Extending Machine Language Models toward Human-Level Language Understanding

【论文推荐】将机器语言模型扩展到人类级别的语言理解，Extending Machine Language Models toward Human-Level Language Understanding

专知会员服务

18+阅读 · 2019年12月14日

【NLP| 推荐文章】语言语音处理（Speech and Language Processing(3rd ed.draft)）

专知会员服务

15+阅读 · 2019年11月24日

Natural Language Interface to Knowledge Graph (our experience) ，加州大学圣塔芭芭拉分校严锡峰副教授，CIPS ATT 16（2019）

Natural Language Interface to Knowledge Graph (our experience) ，加州大学圣塔芭芭拉分校严锡峰副教授，CIPS ATT 16（2019）

专知会员服务

16+阅读 · 2019年10月25日

热门VIP内容

开通专知VIP会员享更多权益服务

网状网络及其在军事领域的运用

无美国参与的欧洲战争方式（万字长文）

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

相关资讯

白话attention综述（上）

白话attention综述（上）

AINLP

12+阅读 · 2019年12月14日

从One-hot, Word embedding到Transformer，一步步教你理解Bert

从One-hot, Word embedding到Transformer，一步步教你理解Bert

AI100

15+阅读 · 2019年6月25日

NLP中的词向量对比：word2vec/glove/fastText/elmo/GPT/bert

NLP中的词向量对比：word2vec/glove/fastText/elmo/GPT/bert

AINLP

31+阅读 · 2019年6月1日

中文对比英文自然语言处理NLP的区别综述

中文对比英文自然语言处理NLP的区别综述

AINLP

18+阅读 · 2019年3月20日

近期语音类前沿论文

近期语音类前沿论文

深度学习每日摘要

14+阅读 · 2019年3月17日

上百种预训练中文词向量：Chinese-Word-Vectors

上百种预训练中文词向量：Chinese-Word-Vectors

AINLP

23+阅读 · 2019年2月26日

NLP Chinese Corpus：大规模中文自然语言处理语料

NLP Chinese Corpus：大规模中文自然语言处理语料

PaperWeekly

14+阅读 · 2019年2月18日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

自然语言处理 | 使用Spacy 进行自然语言处理（二）

自然语言处理 | 使用Spacy 进行自然语言处理（二）

机器学习和数学

10+阅读 · 2018年8月27日

100+中文词向量，总有一款适合你

100+中文词向量，总有一款适合你

专知

12+阅读 · 2018年5月13日

相关论文

AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

Arxiv

0+阅读 · 6月23日

Curiosity as Linguistic Intervention: Using LLM Tutoring Dialogues to Influence Exploratory Learning Behavior

Arxiv

0+阅读 · 6月21日

Fine-Tuning Large Language Models for Quantum Reasoning

Arxiv

0+阅读 · 6月20日

Beyond Value Benchmarks: Measuring Value-Structure Alignment in Large Language Models via Symmetric Q-Sorts

Arxiv

0+阅读 · 6月20日

The Language-Energy Divide: Measuring Energy Costs of Multilingual LLM Inference

Arxiv

0+阅读 · 6月20日

Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights

Arxiv

0+阅读 · 6月19日

Generalization of Fine-Tuned Uncertainty Communication and Metacognition in Large Language Models

Arxiv

0+阅读 · 6月18日

A Survey of On-Policy Distillation for Large Language Models

Arxiv

0+阅读 · 6月18日

Enumeration in the lattice of $q$-decreasing words

Arxiv

0+阅读 · 6月18日

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

Arxiv

0+阅读 · 6月17日

相关基金

基于犹豫模糊语言信息的定性决策理论与方法

国家自然科学基金

2+阅读 · 2015年12月31日

强调与对比影响语篇理解的认知过程及其神经机制

国家自然科学基金

4+阅读 · 2015年12月31日

维吾尔语韵律结构的分析与预测模型的研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向汉语-泰语跨语言新闻事件检索方法研究

国家自然科学基金

2+阅读 · 2014年12月31日

笔迹图像中关键词语过滤技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

中国进口增长及其对国内经济发展促进作用研究

国家自然科学基金

0+阅读 · 2014年12月31日

柬埔寨语命名实体识别及汉柬双语可比语料库构建方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

维吾尔文印刷文档图像中不良信息过滤关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向连续语音的哈萨克语关键词识别技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

维吾尔语单元集优化关键技术研究及其在语音识别中的应用

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员