We study the Collatz total stopping time $τ(n)$ over $n\le 10^7$ from a probabilistic machine learning viewpoint. Empirically, $τ(n)$ is a skewed and heavily overdispersed count with pronounced arithmetic heterogeneity. We develop two complementary models. First, a Bayesian hierarchical Negative Binomial regression (NB2-GLM) predicts $τ(n)$ from simple covariates ($\log n$ and residue class $n \bmod 8$), quantifying uncertainty via posterior and posterior predictive distributions. Second, we propose a mechanistic generative approximation based on the odd-block decomposition: for odd $m$, write $3m+1=2^{K(m)}m'$ with $m'$ odd and $K(m)=v_2(3m+1)\ge 1$; randomizing these block lengths yields a stochastic approximation calibrated via a Dirichlet-multinomial update. On held-out data, the NB2-GLM achieves substantially higher predictive likelihood than the odd-block generators. Conditioning the block-length distribution on $m\bmod 8$ markedly improves the generator's distributional fit, indicating that low-order modular structure is a key driver of heterogeneity in $τ(n)$.
翻译:我们从概率机器学习的角度研究了$n\le 10^7$范围内的考拉兹总停止时间$τ(n)$。经验上,$τ(n)$是一个偏斜且高度过度分散的计数变量,具有显著的算术异质性。我们开发了两种互补的模型。首先,一个贝叶斯层次化负二项回归模型(NB2-GLM)通过简单协变量($\log n$和余数类$n \bmod 8$)预测$τ(n)$,并利用后验分布和后验预测分布量化不确定性。其次,我们提出了一种基于奇块分解的机制生成近似:对于奇数$m$,将$3m+1$写作$2^{K(m)}m'$,其中$m'$为奇数且$K(m)=v_2(3m+1)\ge 1$;对这些块长度进行随机化处理,并通过狄利克雷-多项更新进行校准,得到一个随机近似。在保留数据上,NB2-GLM模型相比奇块生成器获得了显著更高的预测似然。将块长度分布条件于$m\bmod 8$后,生成器的分布拟合度显著改善,这表明低阶模结构是$τ(n)$异质性的关键驱动因素。