Article,

Optimal Identity Testing with High Probability

, , , and .
(2017)cite arxiv:1708.02728.

Abstract

We study the problem of testing identity against a given distribution with a focus on the high confidence regime. More precisely, given samples from an unknown distribution $p$ over $n$ elements, an explicitly given distribution $q$, and parameters $0< \epsilon, < 1$, we wish to distinguish, \em with probability at least $1-\delta$, whether the distributions are identical versus $\varepsilon$-far in total variation distance. Most prior work focused on the case that $= Ømega(1)$, for which the sample complexity of identity testing is known to be $\Theta(n/\epsilon^2)$. Given such an algorithm, one can achieve arbitrarily small values of $\delta$ via black-box amplification, which multiplies the required number of samples by $\Theta(łog(1/\delta))$. We show that black-box amplification is suboptimal for any $= o(1)$, and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is \ \Thetałeft( 1\epsilon^2łeft(n łog(1/\delta) + łog(1/\delta) \right)\right) \ for any $n, \varepsilon$, and $\delta$. For the special case of uniformity testing, where the given distribution is the uniform distribution $U_n$ over the domain, our new tester is surprisingly simple: to test whether $p = U_n$ versus $d_TV(p, U_n) \geq \varepsilon$, we simply threshold $d_TV(p, U_n)$, where $p$ is the empirical probability distribution. The fact that this simple "plug-in" estimator is sample-optimal is surprising, even in the constant $\delta$ case. Indeed, it was believed that such a tester would not attain sublinear sample complexity even for constant values of $\varepsilon$ and $\delta$.

Tags

Users

  • @kirk86

Comments and Reviews