The predominant method for computing confidence intervals (CI) in few-shot learning (FSL) is based on sampling the tasks with replacement, i.e.\ allowing the same samples to appear in multiple tasks. This makes the CI misleading in that it takes into account the randomness of the sampler but not the data itself. To quantify the extent of this problem, we conduct a comparative analysis between CIs computed with and without replacement. These reveal a notable underestimation by the predominant method. This observation calls for a reevaluation of how we interpret confidence intervals and the resulting conclusions in FSL comparative studies. Our research demonstrates that the use of paired tests can partially address this issue. Additionally, we explore methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size. We also introduce a new optimized benchmark, which can be accessed at https://github.com/RafLaf/FSL-benchmark-again
翻译:在小样本学习中,计算置信区间的主流方法是基于有放回的任务采样,即允许相同的样本出现在多个任务中。这使得置信区间具有误导性,因为它考虑了采样器的随机性,而非数据本身的随机性。为了量化此问题的严重程度,我们对有放回与无放回采样计算的置信区间进行了对比分析。结果表明,主流方法存在显著的区间低估。这一发现要求我们重新评估在小样本学习的比较研究中,应如何解读置信区间及其得出的结论。我们的研究表明,使用配对检验可以部分解决此问题。此外,我们探索了通过策略性地采样特定规模的任务来进一步减小置信区间(大小)的方法。我们还引入了一个新的优化基准,可通过 https://github.com/RafLaf/FSL-benchmark-again 访问。