Learning from Limited and Imperfect Data

The datasets used for Deep Neural Network training (e.g., ImageNet, MSCOCO, etc.) are often manually balanced across categories (classes) to facilitate learning of all the categories. This curation process is often expensive and requires throwing away precious annotated data to balance the frequency across classes. This is because the distribution of data in the world (e.g., internet, etc.) significantly differs from the well-curated datasets and is often over-populated with samples from common categories. The algorithms designed for well-curated datasets perform suboptimally when used to learn from imperfect datasets with long-tailed imbalances and distribution shifts. For deep models to be widely used, getting away with the costly curation process by developing robust algorithms that can learn from real-world data distribution is necessary. Toward this goal, we develop practical algorithms for Deep Neural Networks that can learn from limited and imperfect data present in the real world. These works are divided into four segments, each covering a scenario of learning from limited or imperfect data. The first part of the works focuses on Learning Generative Models for Long-Tail Data, where we mitigate the mode-collapse for tail (minority) classes and enable diverse aesthetic image generations as head (majority) classes. In the second part, we enable effective generalization on tail classes through Inductive Regularization schemes, which allow tail classes to generalize as the head classes without enforcing explicit generation of images. In the third part, we develop algorithms for Optimizing Relevant Metrics compared to the average accuracy for learning from long-tailed data with limited annotation (semi-supervised), followed by the fourth part, which focuses on the effective domain adaptation of the model to various domains with zero to very few labeled samples.

翻译：深度神经网络训练所用的数据集（如ImageNet、MSCOCO等）通常经过人工类别平衡处理，以促进所有类别的学习。这种数据筛选过程往往成本高昂，且需要丢弃大量珍贵标注数据以实现类别间的频率平衡。这是因为现实世界（如互联网等）的数据分布与精心筛选的数据集存在显著差异，通常过度包含常见类别的样本。为精心筛选数据集设计的算法在用于学习具有长尾不平衡和分布偏移的不完美数据集时表现欠佳。为使深度模型得到广泛应用，必须通过开发能够从真实世界数据分布中学习的鲁棒算法，摆脱成本高昂的数据筛选过程。为此，我们为深度神经网络开发了能够从现实世界中有限且不完美的数据中学习的实用算法。这些研究分为四个部分，分别涵盖从有限或不完美数据中学习的不同场景。第一部分研究聚焦于长尾数据的生成模型学习，我们通过缓解尾部（少数）类别的模式坍塌问题，使其能够像头部（多数）类别一样生成多样化的美学图像。第二部分通过归纳正则化方案实现尾部类别的有效泛化，使尾部类别无需显式生成图像即可达到与头部类别相当的泛化能力。第三部分针对有限标注（半监督）的长尾数据学习，开发了优化相关指标（而非平均准确率）的算法。第四部分则专注于模型在零标注到极少量标注样本条件下向不同领域的有效域适应。