Tokenization with Split Trees - 专知论文

会员服务 ·

0

词元分析器 · 推断 · 词表 · 可约的 · MoDELS ·

Tokenization with Split Trees

翻译：暂无翻译

Craig W. Schmidt,Michael Krumdick,Adam Wiemerslage,Seth Ebner,Varshini Reddy,Yuval Pinter,Chris Tanner

We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure. The Linear Programming (LP) relaxation is near-integral in practice, yielding provably near-optimal vocabularies, with training time empirically scaling quadratically in the number of split trees. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, reducing the number of inference tokens for models using this tokenizer, thus extending the effective context length. ToaST also uses common single-byte tokens less frequently than these baselines, leading to a substantial improvement in Renyi efficiency. In experiments training 1.5B parameter language models, ToaST achieves the highest CORE score, outperforming baselines by 2.6%--7.6%, with significance for two of three, and scoring best on 13 of 22 individual tasks.

翻译：暂无翻译

0

相关内容

词元分析器

词元分析器

[ICML 2026] SOL：让大模型把算力花在关键Token上：自优化语言模型

[ICML 2026] SOL：让大模型把算力花在关键Token上：自优化语言模型

专知会员服务

7+阅读 · 5月12日

《概率结果下全局最优决策的高效树生成方法》最新30页报告

《概率结果下全局最优决策的高效树生成方法》最新30页报告

专知会员服务

17+阅读 · 2025年5月6日

AAAI 2025 | 基于模态分词的细粒度实体表示学习框架

AAAI 2025 | 基于模态分词的细粒度实体表示学习框架

专知会员服务

27+阅读 · 2024年12月26日

【KDD2023】面向高效 Transformer 推断的约束感知与排序蒸馏Token剪枝

【KDD2023】面向高效 Transformer 推断的约束感知与排序蒸馏Token剪枝

专知会员服务

21+阅读 · 2023年6月28日

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

专知会员服务

15+阅读 · 2020年3月7日

【AAAI2020论文-NUS】用于联合实体和关系提取的编译码结构的有效建模

【AAAI2020论文-NUS】用于联合实体和关系提取的编译码结构的有效建模

专知会员服务

22+阅读 · 2019年11月22日

【AAAI2020论文】隐私保留GBDT（Privacy-Preserving Gradient Boosting Decision Trees）

【AAAI2020论文】隐私保留GBDT（Privacy-Preserving Gradient Boosting Decision Trees）

专知会员服务

36+阅读 · 2019年11月15日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

一文读懂Attention机制

一文读懂Attention机制

机器学习与推荐算法

63+阅读 · 2020年6月9日

Github项目推荐 | Sentence Classification - 神经网络句子分类(陈述/疑问/感叹/祈使)

Github项目推荐 | Sentence Classification - 神经网络句子分类(陈述/疑问/感叹/祈使)

AI研习社

14+阅读 · 2019年1月16日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

手把手教 | 深度学习库PyTorch（附代码）

手把手教 | 深度学习库PyTorch（附代码）

数据派THU

27+阅读 · 2018年3月15日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

论文浅尝 | Question Answering over Freebase

论文浅尝 | Question Answering over Freebase

开放知识图谱

19+阅读 · 2018年1月9日

【关关的刷题日记53】 Leetcode 100. Same Tree

【关关的刷题日记53】 Leetcode 100. Same Tree

专知

10+阅读 · 2017年12月1日

TensorFlow seq2seq中的Attention机制（续）

TensorFlow seq2seq中的Attention机制（续）

深度学习每日摘要

15+阅读 · 2017年11月16日

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

KingsGarden

13+阅读 · 2017年7月16日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

基于参数和结构优化的置信规则库推理方法研究

国家自然科学基金

5+阅读 · 2015年12月31日

树上生灭过程收敛速度及p-Laplacian特征值估计

国家自然科学基金

0+阅读 · 2015年12月31日

关联规则集上的知识发现

国家自然科学基金

9+阅读 · 2015年12月31日

基于地面三维激光扫描技术的树木三维建模与参数提取研究

国家自然科学基金

0+阅读 · 2015年12月31日

双群体涌现的智能虚拟根系建模与仿真研究

国家自然科学基金

0+阅读 · 2015年12月31日

约束最小生成树及其在容迟容断网络中的应用

国家自然科学基金

0+阅读 · 2014年12月31日

实验模式动物树鼩MyD88信号通路的初步研究

国家自然科学基金

0+阅读 · 2014年12月31日

抗干扰的农作物种植模式自动提取方法

国家自然科学基金

0+阅读 · 2014年12月31日

随机排队网络的强逼近及其相关渐近分析

国家自然科学基金

0+阅读 · 2014年12月31日

基于贝叶斯推理的模糊逻辑强化学习模型研究

国家自然科学基金

18+阅读 · 2012年12月31日

Simultaneous Latent Budget Trees for Stratified Classification

Arxiv

0+阅读 · 6月11日

Enumerating tuples of spanning trees

Arxiv

0+阅读 · 6月9日

Tree Containment Parameterized by Scanwidth

Arxiv

0+阅读 · 6月5日

On the Hardness of Optimal Motion on Trees

Arxiv

0+阅读 · 6月4日

Bonsai: Compiling Queries to Pruned Tree Traversals

Arxiv

0+阅读 · 6月3日

Forecasting with Hyper-Trees

Arxiv

0+阅读 · 5月29日

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Arxiv

0+阅读 · 5月28日

The Rate-Immediacy Barrier in Explicit Tree Code Constructions

Arxiv

0+阅读 · 5月22日

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

Arxiv

0+阅读 · 5月21日

Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

Arxiv

0+阅读 · 5月13日

VIP会员

文章信息

相关主题

词元分析器

最新内容

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

专知会员服务

0+阅读 · 6分钟前

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

专知会员服务

2+阅读 · 6月17日

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

专知会员服务

2+阅读 · 6月17日

学习数据的几何：形状空间分析数学综述

学习数据的几何：形状空间分析数学综述

专知会员服务

2+阅读 · 6月17日

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

专知会员服务

5+阅读 · 6月17日

定向能反无人机系统最新发展动态

定向能反无人机系统最新发展动态

专知会员服务

6+阅读 · 6月17日

从燃煤战舰到算法战争：水面指挥的永恒要求

从燃煤战舰到算法战争：水面指挥的永恒要求

专知会员服务

3+阅读 · 6月17日

《短程弹道再入飞行器拦截时间中的一项异常现象》

《短程弹道再入飞行器拦截时间中的一项异常现象》

专知会员服务

4+阅读 · 6月17日

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

专知会员服务

4+阅读 · 6月17日

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

专知会员服务

4+阅读 · 6月17日

《韩国国防政策与军备出口：韩国安全与国防政策如何塑造其国防工业与军备出口格局》最新100页报告

《韩国国防政策与军备出口：韩国安全与国防政策如何塑造其国防工业与军备出口格局》最新100页报告

专知会员服务

2+阅读 · 6月17日

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

专知会员服务

5+阅读 · 6月16日

多模态代码智能综述：从视觉输入到可执行代码系统

多模态代码智能综述：从视觉输入到可执行代码系统

专知会员服务

7+阅读 · 6月16日

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

专知会员服务

5+阅读 · 6月16日

《面向导弹有效发射时机的监督机器学习方法：基于超视距空战仿真》

《面向导弹有效发射时机的监督机器学习方法：基于超视距空战仿真》

专知会员服务

6+阅读 · 6月16日

相关VIP内容

[ICML 2026] SOL：让大模型把算力花在关键Token上：自优化语言模型

[ICML 2026] SOL：让大模型把算力花在关键Token上：自优化语言模型

专知会员服务

7+阅读 · 5月12日

《概率结果下全局最优决策的高效树生成方法》最新30页报告

《概率结果下全局最优决策的高效树生成方法》最新30页报告

专知会员服务

17+阅读 · 2025年5月6日

AAAI 2025 | 基于模态分词的细粒度实体表示学习框架

AAAI 2025 | 基于模态分词的细粒度实体表示学习框架

专知会员服务

27+阅读 · 2024年12月26日

【KDD2023】面向高效 Transformer 推断的约束感知与排序蒸馏Token剪枝

【KDD2023】面向高效 Transformer 推断的约束感知与排序蒸馏Token剪枝

专知会员服务

21+阅读 · 2023年6月28日

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

专知会员服务

15+阅读 · 2020年3月7日

【AAAI2020论文-NUS】用于联合实体和关系提取的编译码结构的有效建模

【AAAI2020论文-NUS】用于联合实体和关系提取的编译码结构的有效建模

专知会员服务

22+阅读 · 2019年11月22日

【AAAI2020论文】隐私保留GBDT（Privacy-Preserving Gradient Boosting Decision Trees）

【AAAI2020论文】隐私保留GBDT（Privacy-Preserving Gradient Boosting Decision Trees）

专知会员服务

36+阅读 · 2019年11月15日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

学习数据的几何：形状空间分析数学综述

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

相关资讯

一文读懂Attention机制

一文读懂Attention机制

机器学习与推荐算法

63+阅读 · 2020年6月9日

Github项目推荐 | Sentence Classification - 神经网络句子分类(陈述/疑问/感叹/祈使)

Github项目推荐 | Sentence Classification - 神经网络句子分类(陈述/疑问/感叹/祈使)

AI研习社

14+阅读 · 2019年1月16日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

手把手教 | 深度学习库PyTorch（附代码）

手把手教 | 深度学习库PyTorch（附代码）

数据派THU

27+阅读 · 2018年3月15日

Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

统计学习与视觉计算组

12+阅读 · 2018年3月15日

论文浅尝 | Question Answering over Freebase

论文浅尝 | Question Answering over Freebase

开放知识图谱

19+阅读 · 2018年1月9日

【关关的刷题日记53】 Leetcode 100. Same Tree

【关关的刷题日记53】 Leetcode 100. Same Tree

专知

10+阅读 · 2017年12月1日

TensorFlow seq2seq中的Attention机制（续）

TensorFlow seq2seq中的Attention机制（续）

深度学习每日摘要

15+阅读 · 2017年11月16日

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

IJCAI | Cascade Dynamics Modeling with Attention-based RNN

KingsGarden

13+阅读 · 2017年7月16日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

相关论文

Simultaneous Latent Budget Trees for Stratified Classification

Arxiv

0+阅读 · 6月11日

Enumerating tuples of spanning trees

Arxiv

0+阅读 · 6月9日

Tree Containment Parameterized by Scanwidth

Arxiv

0+阅读 · 6月5日

On the Hardness of Optimal Motion on Trees

Arxiv

0+阅读 · 6月4日

Bonsai: Compiling Queries to Pruned Tree Traversals

Arxiv

0+阅读 · 6月3日

Forecasting with Hyper-Trees

Arxiv

0+阅读 · 5月29日

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Arxiv

0+阅读 · 5月28日

The Rate-Immediacy Barrier in Explicit Tree Code Constructions

Arxiv

0+阅读 · 5月22日

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

Arxiv

0+阅读 · 5月21日

Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

Arxiv

0+阅读 · 5月13日

相关基金

基于参数和结构优化的置信规则库推理方法研究

国家自然科学基金

5+阅读 · 2015年12月31日

树上生灭过程收敛速度及p-Laplacian特征值估计

国家自然科学基金

0+阅读 · 2015年12月31日

关联规则集上的知识发现

国家自然科学基金

9+阅读 · 2015年12月31日

基于地面三维激光扫描技术的树木三维建模与参数提取研究

国家自然科学基金

0+阅读 · 2015年12月31日

双群体涌现的智能虚拟根系建模与仿真研究

国家自然科学基金

0+阅读 · 2015年12月31日

约束最小生成树及其在容迟容断网络中的应用

国家自然科学基金

0+阅读 · 2014年12月31日

实验模式动物树鼩MyD88信号通路的初步研究

国家自然科学基金

0+阅读 · 2014年12月31日

抗干扰的农作物种植模式自动提取方法

国家自然科学基金

0+阅读 · 2014年12月31日

随机排队网络的强逼近及其相关渐近分析

国家自然科学基金

0+阅读 · 2014年12月31日

基于贝叶斯推理的模糊逻辑强化学习模型研究

国家自然科学基金

18+阅读 · 2012年12月31日

微信扫码咨询专知VIP会员