For many tasks of data analysis, we may only have the information of the explanatory variable and the evaluation of the response values are quite expensive. While it is impractical or too costly to obtain the responses of all units, a natural remedy is to judiciously select a good sample of units, for which the responses are to be evaluated. In this paper, we adopt the classical criteria in design of experiments to quantify the information of a given sample regarding parameter estimation. Then, we provide a theoretical justification for approximating the optimal sample problem by a continuous problem, for which fast algorithms can be further developed with the guarantee of global convergence. Our results have the following novelties: (i) The statistical efficiency of any candidate sample can be evaluated without knowing the exact optimal sample; (ii) It can be applied to a very wide class of statistical models; (iii) It can be integrated with a broad class of information criteria; (iv) It is much faster than existing algorithms. $(v)$ A geometric interpretation is adopted to theoretically justify the relaxation of the original combinatorial problem to continuous optimization problem.
翻译:在许多数据分析任务中,我们可能仅拥有解释变量的信息,而响应值的评估成本相当高昂。虽然获取所有单元的响应不切实际或成本过高,一个自然的补救措施是明智地选择一个优质样本单元来评估其响应值。本文采用实验设计中的经典准则,量化给定样本在参数估计方面的信息量。随后,我们从理论上论证了通过连续问题近似最优样本问题的合理性,并据此可开发具有全局收敛保证的快速算法。本研究具有以下创新点:(i) 无需知晓确切的最优样本即可评估任何候选样本的统计效率;(ii) 该方法适用于极为广泛的统计模型类;(iii) 可与多种信息准则结合使用;(iv) 运算速度远快于现有算法;(v) 采用几何解释从理论上证明了将原始组合问题松弛为连续优化问题的合理性。