Modeling text-based time-series to make prediction about a future event or outcome is an important task with a wide range of applications. The standard approach is to train and test the model using the same input window, but this approach neglects the data collected in longer input windows between the prediction time and the final outcome, which are often available during training. In this study, we propose to treat this neglected text as privileged information available during training to enhance early prediction modeling through knowledge distillation, presented as Learning using Privileged tIme-sEries Text (LuPIET). We evaluate the method on clinical and social media text, with four clinical prediction tasks based on clinical notes and two mental health prediction tasks based on social media posts. Our results show LuPIET is effective in enhancing text-based early predictions, though one may need to consider choosing the appropriate text representation and windows for privileged text to achieve optimal performance. Compared to two other methods using transfer learning and mixed training, LuPIET offers more stable improvements over the baseline, standard training. As far as we are concerned, this is the first study to examine learning using privileged information for time-series in the NLP context.
翻译:对文本时序数据进行建模以预测未来事件或结果是一项重要任务,具有广泛的应用前景。标准方法使用相同的输入窗口训练和测试模型,但这种方法忽略了预测时间与最终结果之间更长输入窗口内收集的数据——而这些数据在训练阶段通常是可用的。在本研究中,我们提出将这些被忽略的文本作为训练阶段可用的特权信息,通过知识蒸馏增强早期预测建模,该方法被命名为基于特权时序文本学习(LuPIET)。我们在临床和社会媒体文本数据上评估了该方法,涉及基于临床笔记的四项临床预测任务和基于社交媒体帖子的两项心理健康预测任务。结果表明LuPIET能有效提升基于文本的早期预测性能,但需要选择适当的文本表征形式及特权文本时间窗口以获得最优效果。与使用迁移学习和混合训练的两种其他方法相比,LuPIET相比基线标准训练方法提供了更稳定的性能提升。据我们所知,这是首项在自然语言处理背景下探究利用特权信息进行时序学习的研究。