Learning with limited data is one of the biggest problems of machine learning. Current approaches to this issue consist in learning general representations from huge amounts of data before fine-tuning the model on a small dataset of interest. While such technique, coined transfer learning, is very effective in domains such as computer vision or natural langage processing, it does not yet solve common problems of deep learning such as model interpretability or the overall need for data. This thesis explores a different answer to the problem of learning expressive models in data constrained settings: instead of relying on big datasets to learn neural networks, we will replace some modules by known functions reflecting the structure of the data. Very often, these functions will be drawn from the rich literature of kernel methods. Indeed, many kernels can reflect the underlying structure of the data, thus sparing learning parameters to some extent. Our approach falls under the hood of "inductive biases", which can be defined as hypothesis on the data at hand restricting the space of models to explore during learning. We demonstrate the effectiveness of this approach in the context of sequences, such as sentences in natural language or protein sequences, and graphs, such as molecules. We also highlight the relationship between our work and recent advances in deep learning. Additionally, we study convex machine learning models. Here, rather than proposing new models, we wonder which proportion of the samples in a dataset is really needed to learn a "good" model. More precisely, we study the problem of safe sample screening, i.e, executing simple tests to discard uninformative samples from a dataset even before fitting a machine learning model, without affecting the optimal model. Such techniques can be used to prune datasets or mine for rare samples.
翻译:数据有限的学习是机器学习面临的最大问题之一。当前解决该问题的主流方法是从海量数据中学习通用表示,再针对少量感兴趣的数据集微调模型。这种被称为迁移学习的技术在计算机视觉和自然语言处理等领域虽非常有效,却尚未解决深度学习中的常见问题,如模型可解释性或对数据的整体需求。本论文探索了在数据受限条件下学习表达性模型的另一条路径:与其依赖大型数据集训练神经网络,我们转而用反映数据结构的已知函数替换部分模块。这些函数通常源自内核方法领域丰富的文献体系。由于诸多内核能反映数据的底层结构,从而在一定程度上节省了学习参数。我们的方法属于“归纳偏置”的范畴,可定义为对当前数据的假设,用于限制学习过程中需要探索的模型空间。我们在序列(如自然语言句子或蛋白质序列)和图结构(如分子)场景中验证了该方法的有效性,同时强调了本工作与深度学习最新进展的关联。此外,我们研究了凸机器学习模型——在此不提出新模型,而是探究数据集中究竟需要多大比例的样本才能学习到“良好”模型。具体而言,我们研究了安全样本筛选问题,即通过执行简单测试,在拟合机器学习模型之前从数据集中剔除无信息样本,且不影响最优模型。此类技术可用于精简数据集或挖掘稀有样本。