Estimating the entropy rate of discrete time series is a challenging problem with important applications in numerous areas including neuroscience, genomics, image processing and natural language processing. A number of approaches have been developed for this task, typically based either on universal data compression algorithms, or on statistical estimators of the underlying process distribution. In this work, we propose a fully-Bayesian approach for entropy estimation. Building on the recently introduced Bayesian Context Trees (BCT) framework for modelling discrete time series as variable-memory Markov chains, we show that it is possible to sample directly from the induced posterior on the entropy rate. This can be used to estimate the entire posterior distribution, providing much richer information than point estimates. We develop theoretical results for the posterior distribution of the entropy rate, including proofs of consistency and asymptotic normality. The practical utility of the method is illustrated on both simulated and real-world data, where it is found to outperform state-of-the-art alternatives.
翻译:离散时间序列熵率的估计是一个具有挑战性的问题,在神经科学、基因组学、图像处理和自然语言处理等众多领域均有重要应用。现有多种方法可解决此任务,通常基于通用数据压缩算法或底层过程分布的统计估计。本文提出一种完全贝叶斯方法进行熵估计。基于近期提出的贝叶斯上下文树(BCT)框架(该框架将离散时间序列建模为变阶马尔可夫链),我们证明了可以直接从熵率的后验分布中采样。该方法可用于估计完整的后验分布,提供比点估计丰富得多的信息。我们推导了熵率后验分布的理论结果,包括一致性和渐近正态性的证明。通过在模拟数据和真实数据上的实验,该方法展现出优于当前最先进替代方案的实用性能。