We propose a simple approach which combines the strengths of probabilistic graphical models and deep learning architectures for solving the multi-label classification task, focusing specifically on image and video data. First, we show that the performance of previous approaches that combine Markov Random Fields with neural networks can be modestly improved by leveraging more powerful methods such as iterative join graph propagation, integer linear programming, and $\ell_1$ regularization-based structure learning. Then we propose a new modeling framework called deep dependency networks, which augments a dependency network, a model that is easy to train and learns more accurate dependencies but is limited to Gibbs sampling for inference, to the output layer of a neural network. We show that despite its simplicity, jointly learning this new architecture yields significant improvements in performance over the baseline neural network. In particular, our experimental evaluation on three video activity classification datasets: Charades, Textually Annotated Cooking Scenes (TACoS), and Wetlab, and three multi-label image classification datasets: MS-COCO, PASCAL VOC, and NUS-WIDE show that deep dependency networks are almost always superior to pure neural architectures that do not use dependency networks.
翻译:我们提出了一种简单的方法,结合了概率图模型与深度学习架构的优势,用于解决多标签分类任务,特别关注图像和视频数据。首先,我们表明,通过利用更强大的方法,如迭代联合图传播、整数线性规划和基于$\ell_1$正则化的结构学习,可以适度提升先前将马尔可夫随机场与神经网络结合的方法的性能。然后,我们提出一种新的建模框架——深度依赖网络,该框架将依赖网络(一种易于训练、能学习更准确依赖关系但仅限于使用吉布斯采样进行推理的模型)增强到神经网络的输出层。我们证明,尽管这一新架构简单,但联合学习它能在性能上显著超越基线神经网络。具体而言,我们在三个视频活动分类数据集(Charades、文本注释烹饪场景(TACoS)和Wetlab)以及三个多标签图像分类数据集(MS-COCO、PASCAL VOC和NUS-WIDE)上的实验评估表明,深度依赖网络几乎总是优于不使用依赖网络的纯神经架构。