Deep learning approaches to natural language processing have made great strides in recent years. While these models produce symbols that convey vast amounts of diverse knowledge, it is unclear how such symbols are grounded in data from the world. In this paper, we explore the development of a private language for visual data representation by training emergent language (EL) encoders/decoders in both i) a traditional referential game environment and ii) a contrastive learning environment utilizing a within-class matching training paradigm. An additional classification layer utilizing neural machine translation and random forest classification was used to transform symbolic representations (sequences of integer symbols) to class labels. These methods were applied in two experiments focusing on object recognition and action recognition. For object recognition, a set of sketches produced by human participants from real imagery was used (Sketchy dataset) and for action recognition, 2D trajectories were generated from 3D motion capture systems (MOVI dataset). In order to interpret the symbols produced for data in each experiment, gradient-weighted class activation mapping (Grad-CAM) methods were used to identify pixel regions indicating semantic features which contribute evidence towards symbols in learned languages. Additionally, a t-distributed stochastic neighbor embedding (t-SNE) method was used to investigate embeddings learned by CNN feature extractors.
翻译:近年来,深度学习在自然语言处理领域取得了显著进展。尽管这些模型生成的符号蕴含了海量且多样化的知识,但尚不清楚这些符号如何与真实世界数据实现具象连接。本文通过训练涌现语言(EL)编码器/解码器,探索了视觉数据表示的私有语言发展过程:其一采用传统指称游戏环境,其二利用基于类内匹配训练范式的对比学习环境。通过融合神经机器翻译与随机森林分类的附加分类层,我们将符号表示(整数符号序列)转化为类别标签。这些方法被应用于物体识别与动作识别两项实验:物体识别采用基于真实影像的人类手绘草图集(Sketchy数据集),动作识别则使用三维动作捕捉系统生成的二维轨迹(MOVI数据集)。为解析各实验中数据生成的符号,我们采用梯度加权类激活映射(Grad-CAM)方法识别像素区域,揭示为所学语言符号贡献语义特征的关键区域。同时,利用t分布随机邻域嵌入(t-SNE)方法探究CNN特征提取器习得的嵌入表征。