Hierarchical sequence models replace fixed tokenization with learned segmentations that compress long byte sequences for efficient autoregressive modeling. While recent end-to-end methods can learn meaningful boundaries from the language-modeling objective alone, it remains difficult to quantitatively assess and systematically steer where compute is spent. We introduce a router-agnostic metric of boundary quality, boundary enrichment B, which measures how strongly chunk starts concentrate on positions with high next-byte surprisal. Guided by this metric, we propose Sombrero, which steers boundary placement toward predictive difficulty via a confidence-alignment boundary loss and stabilizes boundary learning by applying confidence-weighted smoothing at the input level rather than on realized chunks. On 1B scale, across UTF-8 corpora covering English and German text as well as code and mathematical content, Sombrero improves the accuracy-efficiency trade-off and yields boundaries that more consistently align compute with hard-to-predict positions.
翻译:分层序列模型通过习得的切分替代固定分词,以压缩长字节序列,实现高效的自回归建模。尽管近期的端到端方法能够仅从语言建模目标中学习到有意义的边界,但定量评估并系统引导计算资源分配位置仍具挑战。我们提出了一种与路由机制无关的边界质量度量——边界富集度B,该指标衡量块起始位置在具有高下一字节惊奇度位置上的集中程度。基于此度量,我们提出了Sombrero方法,该方法通过置信对齐边界损失将边界放置引导至预测困难区域,并通过在输入层面而非已实现块上应用置信加权平滑来稳定边界学习。在涵盖英文与德文文本、代码及数学内容的UTF-8语料上,以10亿参数规模进行的实验表明,Sombrero改善了准确率与效率的权衡,并产生能更一致地将计算资源与难预测位置对齐的边界。