We present a large-scale empirical investigation of the zero-shot learning phenomena in a specific recognizing textual entailment (RTE) task category, i.e. the automated mining of leaderboards for Empirical AI Research. The prior reported state-of-the-art models for leaderboards extraction formulated as an RTE task, in a non-zero-shot setting, are promising with above 90% reported performances. However, a central research question remains unexamined: did the models actually learn entailment? Thus, for the experiments in this paper, two prior reported state-of-the-art models are tested out-of-the-box for their ability to generalize or their capacity for entailment, given leaderboard labels that were unseen during training. We hypothesize that if the models learned entailment, their zero-shot performances can be expected to be moderately high as well--perhaps, concretely, better than chance. As a result of this work, a zero-shot labeled dataset is created via distant labeling formulating the leaderboard extraction RTE task.
翻译:我们针对特定识别文本蕴含任务类别——即实证人工智能研究中排行榜的自动挖掘——中的零样本学习现象,开展了一项大规模实证研究。此前在非零样本设定下,将排行榜抽取表述为识别文本蕴含任务的先进模型取得了超过90%的优异性能。然而,一个核心研究问题始终未得到检验:这些模型是否真正习得了蕴含关系?为此,我们在本实验中采用两种此前报告的先进模型,直接测试其在训练阶段未见排行榜标签下的泛化能力及蕴含识别能力。我们假设:若模型确实习得了蕴含关系,其零样本性能应能保持中等偏高水平——具体而言,至少应优于随机水平。作为本项研究成果,我们通过远程标注方法构建了零样本标注数据集,将排行榜抽取任务重新界定为识别文本蕴含任务。