Motif discovery is a core problem in computational biology, traditionally formulated as a likelihood optimization task that returns a single dominant motif from a DNA sequence dataset. However, regulatory sequence data admit multiple plausible motif explanations, reflecting underlying biological heterogeneity. In this work, we frame motif discovery as a quality-diversity problem and apply the MAP-Elites algorithm to evolve position weight matrix motifs under a likelihood-based fitness objective while explicitly preserving diversity across biologically meaningful dimensions. We evaluate MAP-Elites using three complementary behavioral characterizations that capture trade-offs between motif specificity, compositional structure, coverage, and robustness. Experiments on human CTCF liver ChIP-seq data aligned to the human reference genome compare MAP-Elites against a standard motif discovery tool, MEME, under matched evaluation criteria across stratified dataset subsets. Results show that MAP-Elites recovers multiple high-quality motif variants with fitness comparable to MEME's strongest solutions while revealing structured diversity obscured by single-solution approaches.
翻译:基序发现是计算生物学中的一个核心问题,传统上被表述为一种似然优化任务,旨在从DNA序列数据集中返回单个主导基序。然而,调控序列数据允许多种合理的基序解释,这反映了潜在的生物学异质性。在本工作中,我们将基序发现构建为一个质量-多样性优化问题,并应用MAP-Elites算法,在基于似然的适应度目标下进化位置权重矩阵基序,同时明确地在具有生物学意义的维度上保持多样性。我们使用三种互补的行为特征来评估MAP-Elites,这些特征捕捉了基序特异性、组成结构、覆盖度和鲁棒性之间的权衡。在人类CTCF肝脏ChIP-seq数据(比对至人类参考基因组)上的实验,将MAP-Elites与标准基序发现工具MEME进行了比较,评估基于分层数据集子集在匹配的评估标准下进行。结果表明,MAP-Elites能够恢复多种高质量基序变体,其适应度与MEME的最优解相当,同时揭示了被单解方法所掩盖的结构化多样性。