There is a growing body of work seeking to replicate the success of machine learning (ML) on domains like computer vision (CV) and natural language processing (NLP) to applications involving biophysical data. One of the key ingredients of prior successes in CV and NLP was the broad acceptance of difficult benchmarks that distilled key subproblems into approachable tasks that any junior researcher could investigate, but good benchmarks for biophysical domains are rare. This scarcity is partially due to a narrow focus on benchmarks which simulate biophysical data; we propose instead to carefully abstract biophysical problems into simpler ones with key geometric similarities. In particular we propose a new class of closed-form test functions for biophysical sequence optimization, which we call Ehrlich functions. We provide empirical results demonstrating these functions are interesting objects of study and can be non-trivial to solve with a standard genetic optimization baseline.
翻译:随着机器学习在计算机视觉和自然语言处理等领域取得显著成功,越来越多的研究试图将这一成功经验推广至涉及生物物理数据的应用领域。计算机视觉与自然语言处理先前成功的关键因素之一,在于广泛接受了将核心子问题提炼为可操作任务的困难基准测试,使得初级研究者也能开展研究,然而针对生物物理领域的优质基准测试却十分稀缺。这种稀缺性部分源于当前基准测试过度局限于模拟生物物理数据;我们提出应通过精心抽象,将生物物理问题转化为具有关键几何相似性的简化问题。具体而言,我们提出了一类用于生物物理序列优化的新型闭式测试函数,并将其命名为埃利希函数。我们通过实证结果表明,这些函数具有重要的研究价值,且使用标准遗传优化基线求解时可能具有非平凡难度。