Experimental design techniques such as active search and Bayesian optimization are widely used in the natural sciences for data collection and discovery. However, existing techniques tend to favor exploitation over exploration of the search space, which causes them to get stuck in local optima. This ``collapse" problem prevents experimental design algorithms from yielding diverse high-quality data. In this paper, we extend the Vendi scores -- a family of interpretable similarity-based diversity metrics -- to account for quality. We then leverage these quality-weighted Vendi scores to tackle experimental design problems across various applications, including drug discovery, materials discovery, and reinforcement learning. We found that quality-weighted Vendi scores allow us to construct policies for experimental design that flexibly balance quality and diversity, and ultimately assemble rich and diverse sets of high-performing data points. Our algorithms led to a 70%-170% increase in the number of effective discoveries compared to baselines.
翻译:实验设计技术(如主动搜索和贝叶斯优化)在自然科学领域被广泛用于数据收集与发现。然而,现有技术往往倾向于利用而非探索搜索空间,这导致其陷入局部最优。这种“崩溃”问题阻碍了实验设计算法生成多样化的高质量数据。本文对Vendi分数(一类可解释的基于相似性的多样性度量)进行扩展,使其能够考虑质量因素。随后,我们利用这些质量加权Vendi分数解决跨多个应用的实验设计问题,包括药物发现、材料发现和强化学习。研究发现,质量加权Vendi分数使我们能够构建灵活权衡质量与多样性的实验设计策略,最终汇聚出丰富且多样化的高性能数据点。与基线方法相比,我们的算法使有效发现数量提升了70%至170%。