We refine and generalize what is known about coresets for classification problems via the sensitivity sampling framework. Such coresets seek the smallest possible subsets of input data, so one can optimize a loss function on the coreset and ensure approximation guarantees with respect to the original data. Our analysis provides the first no dimensional coresets, so the size does not depend on the dimension. Moreover, our results are general, apply for distributional input and can use iid samples, so provide sample complexity bounds, and work for a variety of loss functions. A key tool we develop is a Radamacher complexity version of the main sensitivity sampling approach, which can be of independent interest.
翻译:我们通过敏感性采样框架,优化并推广了关于分类问题核心集的现有理论。此类核心集旨在从输入数据中筛选出尽可能小的子集,使得在核心集上优化损失函数时,能确保相对于原始数据的近似保证。我们的分析首次提出了无维度核心集,其规模不依赖于数据维度。此外,我们的结果具有普适性,适用于分布性输入,可利用独立同分布样本提供样本复杂度界,并支持多种损失函数。我们开发的关键工具是主敏感性采样方法的拉德马赫复杂度版本,该工具本身可能具有独立的研究价值。