Towards Handling Uncertainty-at-Source in AI -- A Review and Next Steps for Interval Regression

Most of statistics and AI draw insights through modelling discord or variance between sources of information (i.e., inter-source uncertainty). Increasingly, however, research is focusing upon uncertainty arising at the level of individual measurements (i.e., within- or intra-source), such as for a given sensor output or human response. Here, adopting intervals rather than numbers as the fundamental data-type provides an efficient, powerful, yet challenging way forward -- offering systematic capture of uncertainty-at-source, increasing informational capacity, and ultimately potential for insight. Following recent progress in the capture of interval-valued data, including from human participants, conducting machine learning directly upon intervals is a crucial next step. This paper focuses on linear regression for interval-valued data as a recent growth area, providing an essential foundation for broader use of intervals in AI. We conduct an in-depth analysis of state-of-the-art methods, elucidating their behaviour, advantages, and pitfalls when applied to datasets with different properties. Specific emphasis is given to the challenge of preserving mathematical coherence -- i.e., ensuring that models maintain fundamental mathematical properties of intervals throughout -- and the paper puts forward extensions to an existing approach to guarantee this. Carefully designed experiments, using both synthetic and real-world data, are conducted -- with findings presented alongside novel visualizations for interval-valued regression outputs, designed to maximise model interpretability. Finally, the paper makes recommendations concerning method suitability for data sets with specific properties and highlights remaining challenges and important next steps for developing AI with the capacity to handle uncertainty-at-source.

翻译：大多数统计学和AI方法通过建模信息源之间的不一致或方差（即源间不确定性）来获取洞察。然而，研究正日益聚焦于单个测量层面（即源内或源内）产生的不确定性，例如特定传感器输出或人类响应。在此，采用区间而非数值作为基本数据类型，提供了一种高效、强大但具有挑战性的前进方向——实现源不确定性的系统化捕获，提升信息容量，并最终挖掘洞察潜力。随着近期在捕获区间值数据（包括来自人类参与者的数据）方面的进展，直接对区间进行机器学习成为关键下一步。本文聚焦于区间值数据的线性回归这一近期增长领域，为区间在AI中的更广泛应用奠定基础。我们对最先进方法进行深入分析，阐明其在具有不同属性的数据集上的行为、优势与缺陷。特别强调保持数学一致性的挑战——即确保模型始终维持区间的基本数学属性——并提出对现有方法的扩展以确保这一特性。通过精心设计的实验（使用合成数据和真实数据），结合新颖的区间值回归输出可视化方法呈现研究发现，以最大化模型可解释性。最后，本文就方法对特定属性数据集的适用性提出建议，并强调开发能够处理源不确定性的AI所面临的剩余挑战与重要未来方向。