Although fundamental to the advancement of Machine Learning, the classic evaluation metrics extracted from the confusion matrix, such as precision and F1, are limited. Such metrics only offer a quantitative view of the models' performance, without considering the complexity of the data or the quality of the hit. To overcome these limitations, recent research has introduced the use of psychometric metrics such as Item Response Theory (IRT), which allows an assessment at the level of latent characteristics of instances. This work investigates how IRT concepts can enrich a confusion matrix in order to identify which model is the most appropriate among options with similar performance. In the study carried out, IRT does not replace, but complements classical metrics by offering a new layer of evaluation and observation of the fine behavior of models in specific instances. It was also observed that there is 97% confidence that the score from the IRT has different contributions from 66% of the classical metrics analyzed.
翻译:尽管源自混淆矩阵的经典评估指标(如精确率和F1分数)对机器学习的发展至关重要,但其存在局限性。这些指标仅提供模型性能的量化视角,未考虑数据的复杂性或预测命中的质量。为克服这些局限,近期研究引入了心理测量学指标(如项目反应理论,IRT)的使用,该理论允许在实例的潜在特征层面进行评估。本研究探讨了如何利用IRT概念来丰富混淆矩阵,以便在性能相近的模型选项中识别最合适的模型。在所开展的研究中,IRT并未取代经典指标,而是通过提供新的评估层及对模型在特定实例中细微行为的观察,对经典指标形成补充。研究还观察到,有97%的置信度表明IRT得分与所分析的66%的经典指标具有不同的贡献度。