We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWeS, selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. IWeS admits significant performance improvement compared to other subset selection algorithms for seven publicly available datasets. Additionally, it is competitive in an active learning setting, where the label information is not available at selection time. We also provide an initial theoretical analysis to support our importance weighting approach, proving generalization and sampling rate bounds.
翻译:我们提出一种子集选择算法,专门设计用于在实际批量场景中适配任意模型族。在此类场景中,算法可逐次采样示例,但为降低计算开销,仅能在选定足够大的批量示例后更新其状态(即进一步训练模型权重)。我们的算法IWeS通过重要性采样选择示例,其中每个示例的采样概率基于先前选定批量训练模型的熵值。与七种公开数据集上的其他子集选择算法相比,IWeS实现了显著的性能提升。此外,在标签信息在选取时不可用的主动学习场景中,该算法同样具有竞争力。我们还提供了初步理论分析以支撑重要性加权方法,证明了泛化界限和采样率界限。