We propose a simple approach which combines the strengths of probabilistic graphical models and deep learning architectures for solving the multi-label classification task, focusing specifically on image and video data. First, we show that the performance of previous approaches that combine Markov Random Fields with neural networks can be modestly improved by leveraging more powerful methods such as iterative join graph propagation, integer linear programming, and $\ell_1$ regularization-based structure learning. Then we propose a new modeling framework called deep dependency networks, which augments a dependency network, a model that is easy to train and learns more accurate dependencies but is limited to Gibbs sampling for inference, to the output layer of a neural network. We show that despite its simplicity, jointly learning this new architecture yields significant improvements in performance over the baseline neural network. In particular, our experimental evaluation on three video activity classification datasets: Charades, Textually Annotated Cooking Scenes (TACoS), and Wetlab, and three multi-label image classification datasets: MS-COCO, PASCAL VOC, and NUS-WIDE show that deep dependency networks are almost always superior to pure neural architectures that do not use dependency networks.
翻译:我们提出了一种简单的方法,结合概率图模型与深度学习架构的优势来解决多标签分类任务,特别针对图像和视频数据。首先,我们证明,通过利用更强大的方法(如迭代联合图传播、整数线性规划和基于$\ell_1$正则化的结构学习),可以适度提升先前将马尔可夫随机场与神经网络结合的方法的性能。随后,我们提出了一种名为深度依赖网络的新建模框架,该框架将依赖网络(一种易于训练且能学习更精确依赖关系,但推理仅限于吉布斯采样的模型)增强至神经网络的输出层。研究表明,尽管架构简洁,但联合学习这一新架构能在基线神经网络的基础上显著提升性能。具体而言,我们在三个视频活动分类数据集(Charades、文本注释烹饪场景(TACoS)和Wetlab)以及三个多标签图像分类数据集(MS-COCO、PASCAL VOC和NUS-WIDE)上的实验评估表明,深度依赖网络几乎总是优于未使用依赖网络的纯神经架构。