We propose two frameworks to deal with problem settings in which both structured and unstructured data are available. Structured data problems are best solved by traditional machine learning models such as boosting and tree-based algorithms, whereas deep learning has been widely applied to problems dealing with images, text, audio, and other unstructured data sources. However, for the setting in which both structured and unstructured data are accessible, it is not obvious what the best modeling approach is to enhance performance on both data sources simultaneously. Our proposed frameworks allow joint learning on both kinds of data by integrating the paradigms of boosting models and deep neural networks. The first framework, the boosted-feature-vector deep learning network, learns features from the structured data using gradient boosting and combines them with embeddings from unstructured data via a two-branch deep neural network. Secondly, the two-weak-learner boosting framework extends the boosting paradigm to the setting with two input data sources. We present and compare first- and second-order methods of this framework. Our experimental results on both public and real-world datasets show performance gains achieved by the frameworks over selected baselines by magnitudes of 0.1% - 4.7%.
翻译:我们提出了两种框架,用于处理同时具有结构化和非结构化数据的问题场景。结构化数据问题最适合通过梯度提升、树模型等传统机器学习方法解决,而深度学习已广泛应用于图像、文本、音频等非结构化数据源的处理。然而,在同时具备两种数据类型的场景中,如何选择最优建模方法以同时提升两种数据源的性能尚不明确。本文提出的框架通过融合提升模型与深度神经网络的范式,实现了对两类数据的联合学习。第一种框架——提升特征向量深度学习网络——利用梯度提升从结构化数据中学习特征,并通过双分支深度神经网络将其与来自非结构化数据的嵌入表示相结合。第二种框架——双弱学习器提升——将提升范式扩展至包含两种输入数据源的场景。我们提出了该框架的一阶与二阶方法并进行了对比。在公开数据集与真实数据集上的实验结果表明,所提框架相较于基准方法实现了0.1%至4.7%的性能提升。