Enhancing non-Perl bioinformatic applications with Perl: Building novel, component based applications using Object Orientation, PDL, Alien, FFI, Inline and OpenMP

Perl · Bioinformatics · FAST · CASE · Processing（编程语言） ·

2024 年 6 月 11 日

翻译：利用Perl增强非Perl生物信息学应用：基于面向对象、PDL、Alien、FFI、Inline与OpenMP构建新型组件化应用

Christos Argyropoulos

from arxiv, 36 pages, 8 figures

Component-Based Software Engineering (CBSE) is a methodology that assembles pre-existing, re-usable software components into new applications, which is particularly relevant for fast moving, data-intensive fields such as bioinformatics. While Perl was used extensively in this field until a decade ago, more recent applications opt for a Bioconductor/R or Python. This trend represents a significantly missed opportunity for the rapid generation of novel bioinformatic applications out of pre-existing components since Perl offers a variety of abstractions that can facilitate composition. In this paper, we illustrate the utility of Perl for CBSE through a combination of Object Oriented frameworks, the Perl Data Language and facilities for interfacing with non-Perl code through Foreign Function Interfaces and inlining of foreign source code. To do so, we enhance Polyester, a RNA sequencing simulator written in R, and edlib a fast sequence similarity search library based on the edit distance. The first case study illustrates the near effortless authoring of new, highly performant Perl modules for the simulation of random numbers using the GNU Scientific Library and PDL, and proposes Perl and Perl/C alternatives to the Python tool cutadapt that is used to "trim" polyA tails from biological sequences. For the edlib case, we leverage the power of metaclass programming to endow edlib with coarse, process based parallelism, through the Many Core Engine (MCE) module and fine grained parallelism through OpenMP, a C/C++/Fortran Application Programming Interface for shared memory multithreaded processing. These use cases provide proof-of-concept for the Bio::SeqAlignment framework, which can organize heterogeneous components in complex memory and command-line based workflows for the construction of novel bionformatic tools to analyze data from long-read sequencing, e.g. Nanopore, sequencing platforms.

翻译：基于组件的软件工程（CBSE）是一种将现有可复用软件组件组装成新应用的方法论，这对于生物信息学等快速发展的数据密集型领域尤为重要。尽管Perl在十年前曾在该领域被广泛使用，但近年来的应用更倾向于选择Bioconductor/R或Python。这种趋势意味着错失了利用现有组件快速生成新型生物信息学应用的重要机遇，因为Perl提供了多种可促进组件组合的抽象机制。本文通过结合面向对象框架、Perl数据语言（PDL）以及通过外部函数接口和外部源代码内联与非Perl代码交互的功能，阐明了Perl在CBSE中的实用性。为此，我们增强了用R编写的RNA测序模拟器Polyester，以及基于编辑距离的快速序列相似性搜索库edlib。第一个案例研究展示了如何近乎零成本地开发高性能Perl模块，利用GNU科学库和PDL进行随机数模拟，并提出了替代Python工具cutadapt（用于从生物序列中"修剪"polyA尾）的Perl及Perl/C方案。针对edlib案例，我们利用元类编程的能力，通过多核引擎（MCE）模块为edlib赋予基于进程的粗粒度并行性，并通过OpenMP（一种用于共享内存多线程处理的C/C++/Fortran应用程序接口）实现细粒度并行。这些用例为Bio::SeqAlignment框架提供了概念验证，该框架可在基于复杂内存和命令行的流程中组织异构组件，用于构建分析长读长测序（如纳米孔测序平台）数据的新型生物信息学工具。