Finding errors in machine learning applications requires a thorough exploration of their behavior over data. Existing approaches used by practitioners are often ad-hoc and lack the abstractions needed to scale this process. We present TorchQL, a programming framework to evaluate and improve the correctness of machine learning applications. TorchQL allows users to write queries to specify and check integrity constraints over machine learning models and datasets. It seamlessly integrates relational algebra with functional programming to allow for highly expressive queries using only eight intuitive operators. We evaluate TorchQL on diverse use-cases including finding critical temporal inconsistencies in objects detected across video frames in autonomous driving, finding data imputation errors in time-series medical records, finding data labeling errors in real-world images, and evaluating biases and constraining outputs of language models. Our experiments show that TorchQL enables up to 13x faster query executions than baselines like Pandas and MongoDB, and up to 40% shorter queries than native Python. We also conduct a user study and find that TorchQL is natural enough for developers familiar with Python to specify complex integrity constraints.
翻译:在机器学习应用中查找错误需要彻底探索其在数据上的行为。现有从业者使用的方法往往是临时性的,且缺乏扩展这一过程所需的抽象能力。我们提出TorchQL,一个用于评估和提升机器学习应用正确性的编程框架。TorchQL允许用户编写查询来指定和检查机器学习模型及数据集上的完整性约束。它无缝结合了关系代数与函数式编程,仅通过八个直观运算符即可实现高表达力的查询。我们在多种用例中评估了TorchQL,包括在自动驾驶中跨视频帧检测到的对象里发现关键时间不一致性、在时间序列医疗记录中查找数据插补错误、在现实图像中查找数据标注错误,以及评估语言模型的偏见并约束其输出。实验表明,与Pandas和MongoDB等基线相比,TorchQL的查询执行速度最高提升13倍,查询长度比原生Python缩短最多40%。我们还进行了一项用户研究,发现TorchQL对熟悉Python的开发者足够自然,能够指定复杂的完整性约束。