Approximate learning of parsimonious Bayesian context trees

Models for categorical sequences typically assume exchangeable or first-order dependent sequence elements. These are common assumptions, for example, in models of computer malware traces and protein sequences. Although such simplifying assumptions lead to computational tractability, these models fail to capture long-range, complex dependence structures that may be harnessed for greater predictive power. To this end, a Bayesian modelling framework is proposed to parsimoniously capture rich dependence structures in categorical sequences, with memory efficiency suitable for real-time processing of data streams. Parsimonious Bayesian context trees are introduced as a form of variable-order Markov model with conjugate prior distributions. The novel framework requires fewer parameters than fixed-order Markov models by dropping redundant dependencies and clustering sequential contexts. Approximate inference on the context tree structure is performed via a computationally efficient model-based agglomerative clustering procedure. The proposed framework is tested on synthetic and real-world data examples, and it outperforms existing sequence models when fitted to real protein sequences and honeypot computer terminal sessions.

翻译：针对分类序列的模型通常假设序列元素具有可交换性或一阶依赖性。此类假设在计算机恶意软件轨迹和蛋白质序列等模型中十分常见。虽然这类简化假设能带来计算上的可处理性，但这些模型无法捕捉可能被用于增强预测能力的长程复杂依赖结构。为此，本文提出一种贝叶斯建模框架，能以简约方式捕捉分类序列中的丰富依赖结构，其内存效率适用于数据流的实时处理。简约贝叶斯上下文树被引入作为一种具有共轭先验分布的变阶马尔可夫模型。该新颖框架通过剔除冗余依赖关系并对序列上下文进行聚类，所需参数少于固定阶马尔可夫模型。通过计算高效的基于模型的凝聚聚类过程，对上下文树结构进行近似推断。所提框架在合成数据和真实数据示例中进行了测试，在拟合真实蛋白质序列和蜜罐计算机终端会话时，其性能优于现有序列模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日