Open-vocabulary learning requires modeling the data distribution in open environments, which consists of both seen-class and unseen-class data. Existing methods estimate the distribution in open environments using seen-class data, where the absence of unseen classes makes the estimation error inherently unidentifiable. Intuitively, learning beyond the seen classes is crucial for distribution estimation to bound the estimation error. We theoretically demonstrate that the distribution can be effectively estimated by generating unseen-class data, through which the estimation error is upper-bounded. Building on this theoretical insight, we propose a novel open-vocabulary learning method, which generates unseen-class data for estimating the distribution in open environments. The method consists of a class-domain-wise data generation pipeline and a distribution alignment algorithm. The data generation pipeline generates unseen-class data under the guidance of a hierarchical semantic tree and domain information inferred from the seen-class data, facilitating accurate distribution estimation. With the generated data, the distribution alignment algorithm estimates and maximizes the posterior probability to enhance generalization in open-vocabulary learning. Extensive experiments on $11$ datasets demonstrate that our method outperforms baseline approaches by up to $14\%$, highlighting its effectiveness and superiority.
翻译:开放词汇学习需要对开放环境中的数据分布进行建模,该环境包含可见类数据和不可见类数据。现有方法利用可见类数据估计开放环境中的分布,但由于不可见类的缺失,估计误差本质上是不可识别的。直观上,超越可见类的学习对于约束分布估计误差至关重要。我们从理论上证明,通过生成不可见类数据可以有效估计分布,且估计误差存在上界。基于这一理论见解,我们提出了一种新颖的开放词汇学习方法,通过生成不可见类数据来估计开放环境中的分布。该方法包含一个类域级数据生成流程和一个分布对齐算法。数据生成流程在层次语义树和从可见类数据推断出的领域信息指导下生成不可见类数据,从而促进精确的分布估计。利用生成的数据,分布对齐算法通过估计并最大化后验概率来增强开放词汇学习中的泛化能力。在$11$个数据集上的大量实验表明,我们的方法优于基线方法达$14\%$,突显了其有效性和优越性。