Safe Paths and Sequences for Scalable ILPs in RNA Transcript Assembly Problems

A common step at the core of many RNA transcript assembly tools is to find a set of weighted paths that best explain the weights of a DAG. While such problems easily become NP-hard, scalable solvers exist only for a basic error-free version of this problem, namely minimally decomposing a network flow into weighted paths. The main result of this paper is to show that we can achieve speedups of two orders of magnitude also for path-finding problems in the realistic setting (i.e., the weights do not induce a flow). We obtain these by employing the safety information that is encoded in the graph structure inside Integer Linear Programming (ILP) solvers for these problems. We first characterize the paths that appear in all path covers of the DAG, generalizing a graph reduction commonly used in the error-free setting (e.g. by Kloster et al. [ALENEX~2018]). Secondly, following the work of Ma, Zheng and Kingsford [RECOMB 2021], we characterize the \emph{sequences} of arcs that appear in all path covers of the DAG. We experiment with a path-finding ILP model (least squares) and with a more recent and accurate one. We use a variety of datasets originally created by Shao and Kingsford [TCBB, 2017], as well as graphs built from sequencing reads by the state-of-the-art tool for long-read transcript discovery, IsoQuant [Prjibelski et al., Nat.~Biotechnology~2023]. The ILPs armed with safe paths or sequences exhibit significant speed-ups over the original ones. On graphs with a large width, average speed-ups are in the range $50-160\times$ in the latter ILP model and in the range $100-1000\times$ in the least squares model. Our scaling techniques apply to any ILP whose solution paths are a path cover of the arcs of the DAG. As such, they can become a scalable building block of practical RNA transcript assembly tools, avoiding heuristic trade-offs currently needed on complex graphs.

翻译：许多RNA转录本组装工具的核心步骤是寻找一组加权路径，以最优方式解释有向无环图（DAG）的权重。尽管此类问题易变为NP难问题，但现有的可扩展求解器仅适用于该问题的一个基本无误差版本，即最小化地将网络流分解为加权路径。本文的主要结果表明，在现实场景下（即权重不构成流），我们同样能在路径查找问题上实现两个数量级的加速。这是通过在图结构中编码安全信息，并将其应用于这些问题的整数线性规划（ILP）求解器而实现的。我们首先刻画了出现在DAG所有路径覆盖中的路径，推广了无误差设置中常用的图约简方法（例如Kloster等人[ALENEX~2018]的工作）。其次，基于Ma、Zheng和Kingsford[RECOMB 2021]的研究，我们刻画了出现在DAG所有路径覆盖中的弧\emph{序列}。我们实验了一个路径查找ILP模型（最小二乘法）以及一个更新且更精确的模型。我们使用了Shao和Kingsford[TCBB, 2017]最初创建的多组数据集，以及基于长读长转录本发现前沿工具IsoQuant[Prjibelski等人, Nat.~Biotechnology~2023]从测序读数构建的图。配备安全路径或序列的ILP相较于原始模型展现出显著的加速效果。在宽度较大的图上，后一个ILP模型的平均加速比在$50-160\times$范围内，而最小二乘法模型的平均加速比在$100-1000\times$范围内。我们的扩展技术适用于任何解路径构成DAG弧的路径覆盖的ILP模型。因此，这些技术可成为实用RNA转录本组装工具的可扩展构建模块，避免当前在复杂图上所需的启发式权衡。