Online A/B testing has been widely used by software companies to evaluate the impact of new technologies by offering it to a groups of users and comparing against the unmodified product. However, running online A/B testing needs not only efforts in design, implementation and stakeholders' approval to be served in production, but also several weeks to collect the data in iterations. To address these issues, a recent emerging topic, called \textit{offline A/B testing}, is getting increasing attention, with the goal to conduct offline evaluation of a new technology by estimating historical logged data. Although this approach is promising due to lower implementation effort, faster turnaround time and no potential user harm, for it to be effectively prioritized as requirements in practice, several limitations need to be addressed, including its discrepancy with online A/B test results, and lack of systematic updates on new data. In response, in this vision paper, we introduce AutoOffAB, an idea to automatically runs variants of offline A/B testing against recent logging and update the offline evaluation results, which are used to make decisions on requirements more reliably and systematically.
翻译:在线A/B测试已被软件公司广泛应用,通过向部分用户群体提供新技术并与未修改产品进行对比,评估新技术的影响。然而,实施在线A/B测试不仅需要设计、实现和获得利益相关者批准以投入生产环境,还需花费数周时间迭代收集数据。为解决这些问题,近期兴起的"离线A/B测试"研究正获得日益关注,其目标是通过估计历史记录数据对新技术进行离线评估。尽管该方法因实施成本低、反馈周期短且不会对用户造成潜在损害而颇具前景,但若要有效优先作为需求在实践应用中落地,仍需解决若干局限性问题,包括其与在线A/B测试结果的差异,以及缺乏对新数据的系统性更新机制。针对这些挑战,本愿景论文提出AutoOffAB理念,通过自动对最新日志数据运行多种离线A/B测试变体,持续更新离线评估结果,从而更可靠、更系统地支撑需求决策。