Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising 1.7 trillion base-pairs and achieve state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. We unlock noisy real-world corpora, spanning petabases of genomic sequences and terabytes of financial time series, for foundation model training with zero inference overhead.
翻译:当前的分词方法在处理序列数据时未考虑信号质量,限制了其在噪声现实语料库上的有效性。我们提出了QA-Token(质量感知分词)方法,该方法将数据可靠性直接纳入词汇表构建过程。我们做出了三项关键贡献:(i)一种双层优化框架,联合优化词汇表构建与下游任务性能;(ii)一种强化学习方法,通过具有收敛保证的质量感知奖励来学习合并策略;(iii)一种基于Gumbel-Softmax松弛的自适应参数学习机制,用于端到端优化。我们的实验评估展示了持续的改进:在基因组学中(变体调用任务上F1分数较BPE提升6.7个百分点),在金融领域(夏普比率提升30%)。在基础模型规模上,我们对一个包含1.7万亿碱基对的预训练语料库进行分词,并实现了最先进的病原体检测性能(MCC为94.53%),同时将标记数量减少了15%。我们成功解锁了噪声现实语料库——涵盖拍碱基级别的基因组序列和太字节级别的金融时间序列数据——用于基础模型训练,且推理开销为零。