The German Tank Problem with Multiple Factories

During the Second World War, estimates of the number of tanks deployed by Germany were critically needed. The Allies adopted two methods to estimate this information: espionage and statistical analysis. The latter approach was far more successful and is as follows: assuming that the tanks are sequentially numbered starting from 1, if we observe $k$ serial numbers from an unknown total of $N$ tanks, with the highest observed number being $M$, then the best linear unbiased estimator for $N$ is $M(1+1/k)-1$. This is now known as the German Tank Problem. Suppose one wishes to estimate the productivity of a rival by inspecting captured or destroyed tanks, each with a unique serial number. In many situations, the original German Tank Problem is insufficient, since typically there are $l>1$ factories, and tanks produced by different factories may have serial numbers in disjoint ranges that are often far separated, let alone sequentially numbered starting from 1. We wish to estimate the total tank production across all of the factories. We construct an efficient procedure to estimate the total productivity and prove that our procedure effectively estimates $N$ when $\log l/\log k$ is sufficiently small, and is robust against both large and small gaps between factories. In the final section, we show that given information about the gaps, we can make a far better estimator that is also effective when we have a small number of samples. When the number of samples is small compared to the number of gaps, the Mean Squared Error of this new estimator is several orders of magnitude smaller than the one that assumes no information. This quantifies the importance of hiding such information if one wishes to conceal their productivity from a rival.

翻译：二战期间，迫切需要估算德国部署的坦克数量。盟军采用两种方法获取这一信息：间谍活动和统计分析。后者更为成功，其原理如下：假设坦克序列号从1开始连续编号，若从未知总数N辆坦克中观测到k个序列号，其中最大观测值为M，则N的最佳线性无偏估计量为M(1+1/k)-1。这即是著名的德克坦克问题。现假设需通过检查缴获或摧毁的坦克（每辆均有唯一序列号）来评估对手的生产能力。在多数情况下，原始德克坦克问题并不适用，因为通常存在l>1个工厂，且不同工厂生产的坦克序列号可能分布在不相交且间隔较大的区间内，更遑论从1开始连续编号。我们需要估算所有工厂的坦克总产量。我们构建了一种高效的总产量估计方法，并证明当log l/log k足够小时，该方法能有效估计N，且对大间隔与小间隔工厂均具鲁棒性。在最后部分，我们证明：若已知间隔信息，可构造出更优的估计量，在小样本情形下同样有效。当样本量相对于间隔数较小时，该新估计量的均方误差比无信息假设下的估计量低数个数量级。这量化了隐藏此类信息对防止对手窥探产量的重要性。