Factorization Machine with Quadratic-Optimization Annealing for RNA Inverse Folding and Evaluation of Binary-Integer Encoding and Nucleotide Assignment

翻译：基于二次优化退火的因子分解机用于RNA逆折叠及二进制整数编码与核苷酸分配评估

Shuta Kikuchi,Shu Tanaka

from arxiv, 17 pages, 10 figures

The RNA inverse folding problem aims to identify nucleotide sequences that preferentially adopt a given target secondary structure. While various heuristic and machine learning-based approaches have been proposed, many require a large number of sequence evaluations, which limits their applicability when experimental validation is costly. We propose a method to solve the problem using a factorization machine with quadratic-optimization annealing (FMQA). FMQA is a discrete black-box optimization method reported to obtain high-quality solutions with a limited number of evaluations. Applying FMQA to the problem requires converting nucleotides into binary variables. However, the influence of integer-to-nucleotide assignments and binary-integer encoding on the performance of FMQA has not been thoroughly investigated, even though such choices determine the structure of the surrogate model and the search landscape, and thus can directly affect solution quality. Therefore, this study aims both to establish a novel FMQA framework for RNA inverse folding and to analyze the effects of these assignments and encoding methods. We evaluated all 24 possible assignments of the four nucleotides to the ordered integers (0-3), in combination with four binary-integer encoding methods. Our results demonstrated that one-hot and domain-wall encodings outperform binary and unary encodings in terms of the normalized ensemble defect value. In domain-wall encoding, nucleotides assigned to the boundary integers (0 and 3) appeared with higher frequency. In the RNA inverse folding problem, assigning guanine and cytosine to these boundary integers promoted their enrichment in stem regions, which led to more thermodynamically stable secondary structures than those obtained with one-hot encoding.

翻译：RNA逆折叠问题旨在识别优先采用给定目标二级结构的核苷酸序列。尽管已提出多种启发式和基于机器学习的方法，但许多方法需要大量序列评估，这在实验验证成本高昂时限制了其适用性。我们提出了一种使用基于二次优化退火的因子分解机（FMQA）解决该问题的方法。FMQA是一种离散黑盒优化方法，据报道能在有限评估次数内获得高质量解。将FMQA应用于该问题需要将核苷酸转换为二进制变量。然而，整数到核苷酸的分配方案以及二进制整数编码对FMQA性能的影响尚未得到深入研究，尽管这些选择决定了代理模型的结构和搜索空间，从而直接影响解的质量。因此，本研究旨在建立一个用于RNA逆折叠的新型FMQA框架，并分析这些分配和编码方法的影响。我们评估了四种核苷酸与有序整数（0-3）之间所有24种可能的分配方案，并结合了四种二进制整数编码方法。结果表明，在归一化集合缺陷值方面，独热编码和畴壁编码优于二进制编码和一元编码。在畴壁编码中，分配给边界整数（0和3）的核苷酸出现频率更高。在RNA逆折叠问题中，将鸟嘌呤和胞嘧啶分配给这些边界整数促进了它们在茎区的富集，从而获得了比独热编码更热力学稳定的二级结构。