Webstemmer is a web crawler and HTML layout analyzer that automatically extracts main text of a news site without having banners, ads and/or navigation links mixed up
Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, и H. Jagadish. Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, стр. 21--30. Honolulu, Hawaii, Association for Computational Linguistics, (октября 2008)