A new perspective is introduced regarding the analysis of Multiple Sequence Alignments (MSA), representing aligned data defined over a finite alphabet of symbols. The framework is designed to produce a block decomposition of an MSA, where each block is comprised of sequences exhibiting a certain site-coherence. The key component of this framework is an information theoretical potential defined on pairs of sites (links) within the MSA. This potential quantifies the expected drop in variation of information between the two constituent sites, where the expectation is taken with respect to all possible sub-alignments, obtained by removing a finite, fixed collection of rows. It is proved that the potential is zero for linked sites representing columns, whose symbols are in bijective correspondence and it is strictly positive, otherwise. It is furthermore shown that the potential assumes its unique minimum for links at which each symbol pair appears with the same multiplicity. Finally, an application is presented regarding anomaly detection in an MSA, composed of inverse fold solutions of a fixed tRNA secondary structure, where the anomalies are represented by inverse fold solutions of a different RNA structure.
翻译:本文提出了一种分析多序列比对(MSA)的新视角,该比对由有限符号字母表上定义的已对齐数据表征。该框架旨在对MSA进行分块分解,其中每个块由具有特定位点一致性的序列组成。该框架的核心是一个定义在MSA内部位点对(连接)上的信息论势能。该势能量化了两个组成位点之间变异信息期望下降量,其中期望值通过对所有可能的子比对(通过移除一个有限固定行集合获得)计算得到。研究表明,当表示列(其符号处于双射对应关系)的位点被连接时,该势能为零,否则严格为正。此外,进一步证明,当每个符号对以相同重数出现时,该势能在连接处取得唯一最小值。最后,本文展示了该框架在MSA异常检测中的应用——该MSA由固定tRNA二级结构的反向折叠解构成,其中异常由另一种RNA结构的反向折叠解代表。