can now generate all pairs $i,j$ for which $x_i^\pi$ is present in both their sketches. From these we can compute, for each pair $i,j$ with non-zero sketch overlap, a count of the number of $x_i^\pi$ values they have in common. By applying a preset threshold, we know which pairs $i,j$ have heavily overlapping sketches. For instance, if the threshold were 80%, we would need the count to be at least 160 for any $i,j$. As we identify such pairs, we run the union-find to group documents into near-duplicate ``syntactic clusters''. This is essentially a variant of the single-link clustering algorithm introduced in Section 17.2 (page [*]).
D. Fetterly, M. Manasse, and M. Najork. SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, page 170--177. New York, NY, USA, ACM Press, (2005)
H. Khan, K. Maly, and M. Zubair. Research and Advanced Technology for Digital Libraries, volume 3652 of Lecture Notes in Computer Science, Springer, Berlin / Heidelberg, (2005)