Finite-context models (FCMs) are widely used for compressing symbolic sequences such as DNA, where predictive performance depends critically on the context length k and smoothing parameter α. In practice, these hyperparameters are typically selected through exhaustive search, which is computationally expensive and scales poorly with model complexity. This paper proposes a statistically grounded two-step sequential approach for efficient hyperparameter selection in FCMs. The key idea is to decompose the joint optimization problem into two independent stages. First, the context length k is estimated using categorical serial dependence measures, including Cramér's ν, Cohen's \k{appa} and partial mutual information (pami). Second, the smoothing parameter α is estimated via maximum likelihood conditional on the selected context length k. Simulation experiments were conducted on synthetic symbolic sequences generated by FCMs across multiple (k, α) configurations, considering a four-letter alphabet and different sample sizes. Results show that the dependence measures are substantially more sensitive to variations in k than in α, supporting the sequential estimation strategy. As expected, the accuracy of the hyperparameter estimation improves with increasing sample size. Furthermore, the proposed method achieves compression performance comparable to exhaustive grid search in terms of average bitrate (bits per symbol), while substantially reducing computational cost. Overall, the results on simulated data show that the proposed sequential approach is a practical and computationally efficient alternative to exhaustive hyperparameter tuning in FCMs.
翻译:有限上下文模型(FCMs)广泛用于压缩符号序列(如DNA),其预测性能关键取决于上下文长度k和平滑参数α。实际应用中,这些超参数通常通过穷举搜索选择,该过程计算成本高昂且随模型复杂度增长可扩展性差。本文提出一种基于统计的序贯两步方法,用于FCMs的高效超参数选择。核心思想是将联合优化问题分解为两个独立阶段:首先,利用类别序列依赖度量(包括Cramér's ν、Cohen's κ和部分互信息pami)估计上下文长度k;其次,在选定上下文长度k的条件下,通过最大似然估计平滑参数α。实验基于FCM生成的合成符号序列,在多种(k, α)配置及四字母表、不同样本量下进行仿真。结果表明,依赖度量对k的变异敏感度显著高于α,支持了序贯估计策略。与预期一致,超参数估计精度随样本量增加而提升。此外,在平均比特率(每符号比特数)方面,所提方法达到了与穷举网格搜索相当的压缩性能,同时大幅降低了计算成本。总体而言,仿真数据结果显示,所提序贯方法是FCMs中一种实用且计算高效的穷举超参数调优替代方案。