Benchmarking and evaluating deep learning models and systems necessitate a meticulous approach to ensure comprehensive assessment. In practical applications, it is paramount to consider both the inference quality and the inference time, particularly within critical contexts, where stringent requirements demand the simultaneous satisfaction of both metrics. Neglecting either aspect can result in severe and irreversible consequences, including loss of human life and property damage. Unfortunately, many studies lack a comprehensive consideration of these metrics, often conducted under ideal or permissive conditions, thereby leading to incomplete or non-intuitive evaluation methodologies. This study reveals that deep learning inference quality exhibits fluctuations, which further introduces complications and challenges to the benchmarking and evaluation. To better characterize the phenomenon, the concept of "tail quality" is introduced, which indicates the quality at the tail of distributions. "Tail quality" can offer a more objective evaluation, overcoming the limitations of conventional inference quality and inference time metrics in capturing the quality fluctuation phenomenon. To capture the phenomenon, this paper also proposes a pioneering evaluation framework for comprehensive assessment and analysis of various factors affecting inference time and quality. Leveraging this framework enables the anticipation of the potential distribution of inference time and inference quality, thus capturing "tail quality" before practically applying deep learning. The effectiveness of the evaluation framework is validated through experiments conducted on deep learning models for three different tasks across four systems. Furthermore, employing this evaluation framework, the experiments conducted a preliminary analysis of several factors influencing inference quality and inference time.
翻译:深度学习和系统基准测试与评估需要严谨的方法以确保全面的评估。在实际应用中,推理质量与推理时间均至关重要,尤其在关键场景中,严格的要求必须同时满足这两个指标。忽视任意一方面都可能导致严重且不可逆的后果,包括人员伤亡和财产损失。遗憾的是,许多研究缺乏对这些指标的综合考虑,常常在理想或宽松条件下进行,导致评估方法不完整或不直观。本研究表明,深度学习推理质量存在波动现象,这进一步给基准测试和评估带来了复杂性和挑战。为更好地表征这一现象,本文引入了“尾部质量”概念,其指示分布尾部的质量。“尾部质量”能够提供更客观的评估,克服传统推理质量和推理时间指标在捕捉质量波动现象方面的局限性。为捕捉该现象,本文还提出了一种创新的评估框架,用于全面评估和分析影响推理时间与质量的各种因素。借助该框架,能够在实际应用深度学习之前预判推理时间与推理质量的潜在分布,从而捕捉“尾部质量”。通过在四个系统上针对三种不同任务的深度学习模型进行的实验,验证了该评估框架的有效性。此外,运用该评估框架,实验对影响推理质量与推理时间的若干因素进行了初步分析。