We study the problem of learning a hierarchical tree representation of data from labeled samples, taken from an arbitrary (and possibly adversarial) distribution. Consider a collection of data tuples labeled according to their hierarchical structure. The smallest number of such tuples required in order to be able to accurately label subsequent tuples is of interest for data collection in machine learning. We present optimal sample complexity bounds for this problem in several learning settings, including (agnostic) PAC learning and online learning. Our results are based on tight bounds of the Natarajan and Littlestone dimensions of the associated problem. The corresponding tree classifiers can be constructed efficiently in near-linear time.
翻译:我们从任意(可能对抗性)分布中抽取的标记样本中,研究学习数据分层树表示的问题。考虑根据其层级结构标记的一组数据元组。为准确标记后续元组所需的最小此类元组数量,对于机器学习中的数据收集具有重要意义。我们在多种学习设置下(包括(不可知)PAC学习和在线学习)给出了该问题的最优样本复杂度界。我们的结果基于对相关问题的Natarajan维数和Littlestone维数的紧致界定。对应的树分类器可在近线性时间内高效构建。