Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.
ANTLR (ANother Tool for Language Recognition) is a parser and translator generator tool that lets one define language grammars in either ANTLR syntax (which is YACC and EBNF(Extended Backus-Naur Form) like) or a special AST(Abstract Syntax Tree) syntax. ANTLR can create lexers, parsers and AST's. ANTLR is more than just a grammar definition language however, the tools provided allow one to implement the ANTLR defined grammar by automatically generating lexers and parsers (and tree parsers) in either Java (http://java.sun.com/, C++ (http://anubis.dkuug.dk/jtc1/sc22/wg21/ or Sather (http://www.icsi.berkeley.edu/~sather/.
John Geraci is a guest blogger and heads up the DIY City movement. He will be speaking about DIY City at Where 2.0 in San Jose on 5/20. Since early last friday, when I got a tip about swine flu in Mexico City from a health researcher, the team that does SickCity has been working to make the system something...
B. Macek, и M. Atzmueller. Proceeding ASONAM '13 Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, стр. 1477-1478. ACM New York, (2013)
Z. Ma, A. Sun, и G. Cong. Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, стр. 1173--1174. New York, NY, USA, ACM, (2012)