Experience and Prediction: A Metric of Hardness for a Novel Litmus Test

In the last decade, the Winograd Schema Challenge (WSC) has become a central aspect of the research community as a novel litmus test. Consequently, the WSC has spurred research interest because it can be seen as the means to understand human behavior. In this regard, the development of new techniques has made possible the usage of Winograd schemas in various fields, such as the design of novel forms of CAPTCHAs. Work from the literature that established a baseline for human adult performance on the WSC has shown that not all schemas are the same, meaning that they could potentially be categorized according to their perceived hardness for humans. In this regard, this \textit{hardness-metric} could be used in future challenges or in the WSC CAPTCHA service to differentiate between Winograd schemas. Recent work of ours has shown that this could be achieved via the design of an automated system that is able to output the hardness-indexes of Winograd schemas, albeit with limitations regarding the number of schemas it could be applied on. This paper adds to previous research by presenting a new system that is based on Machine Learning (ML), able to output the hardness of any Winograd schema faster and more accurately than any other previously used method. Our developed system, which works within two different approaches, namely the random forest and deep learning (LSTM-based), is ready to be used as an extension of any other system that aims to differentiate between Winograd schemas, according to their perceived hardness for humans. At the same time, along with our developed system we extend previous work by presenting the results of a large-scale experiment that shows how human performance varies across Winograd schemas.

翻译：在过去十年中，维诺格拉德模式挑战（WSC）作为一种新型石蕊测试，已成为研究社区的核心议题。因此，WSC 激发了研究兴趣，因为它可以被视为理解人类行为的手段。在此方面，新技术的开发使得维诺格拉德模式能够应用于各种领域，例如新型验证码的设计。已有文献中建立了人类成年人在 WSC 上的基准表现，表明并非所有模式都相同，即它们可能根据人类感知的难度进行分类。在这种情况下，这种“难度度量”可用于未来的挑战或 WSC 验证码服务中，以区分不同的维诺格拉德模式。我们最近的研究表明，这可以通过设计一个能够输出维诺格拉德模式难度指数的自动化系统来实现，尽管该系统在可应用的模式数量上存在限制。本文在先前研究的基础上，提出了一种基于机器学习（ML）的新系统，能够比以往任何方法更快、更准确地输出任意维诺格拉德模式的难度。我们开发的系统采用两种不同方法，即随机森林和深度学习（基于 LSTM），可作为任何旨在根据人类感知难度区分维诺格拉德模式的系统的扩展。同时，除了我们的系统外，我们还通过展示一项大规模实验的结果来扩展先前工作，该实验揭示了人类在不同维诺格拉德模式上的表现差异。