The Stanford WebBase project has been collecting topic focused snapshots of Web sites. All the resulting archives are available to the public via fast download streams. For example, we collected pages from 350 sites every day for several weeks after the Katrina hurricane disaster. We also collect pages from government Web sites on a regular basis.
The Datawrangling blog was put on the back burner last May while I focused on my startup. Now that I have some bandwidth again, I am getting back to work on several pet projects (including the Amazon EC2 Cluster).