Motif discovery is a core problem in computational biology, traditionally formulated as a likelihood optimization task that returns a single dominant motif from a DNA sequence dataset. However, regulatory sequence data admit multiple plausible motif explanations, reflecting underlying biological heterogeneity. In this work, we frame motif discovery as a quality-diversity problem and apply the MAP-Elites algorithm to evolve position weight matrix motifs under a likelihood-based fitness objective while explicitly preserving diversity across biologically meaningful dimensions. We evaluate MAP-Elites using three complementary behavioral characterizations that capture trade-offs between motif specificity, compositional structure, coverage, and robustness. Experiments on human CTCF liver ChIP-seq data aligned to the human reference genome compare MAP-Elites against a standard motif discovery tool, MEME, under matched evaluation criteria across stratified dataset subsets. Results show that MAP-Elites recovers multiple high-quality motif variants with fitness comparable to MEME's strongest solutions while revealing structured diversity obscured by single-solution approaches.
翻译:基序发现是计算生物学中的核心问题,传统上被形式化为对DNA序列数据集返回单个主导基序的似然优化任务。然而,调控序列数据中存在多种合理的基序解释,反映了潜在的生物异质性。本研究将基序发现重构为质量-多样性问题,并应用MAP-Elites算法,在基于似然的适应度目标下演化位置权重矩阵基序,同时明确保留具有生物学意义维度上的多样性。我们采用三种互补的行为表征评估MAP-Elites,这些表征捕捉了基序特异性、组成结构、覆盖度和鲁棒性之间的权衡。针对比对至人类参考基因组的人类CTCF肝脏ChIP-seq数据,我们在分层数据集子集上使用匹配的评估标准,将MAP-Elites与标准基序发现工具MEME进行对比实验。结果表明,MAP-Elites能恢复多个高质量基序变体,其适应度与MEME最强解相当,同时揭示了被单一解方法掩盖的结构性多样性。