PDFBox is an open source Java PDF library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.
Features
* PDF to text extraction
* Merge PDF Documents
* PDF Document Encryption/Decryption
* Lucene Search Engine Integration
* Fill in form data FDF and XFDF
* Create a PDF from a text file
* Create images from PDF pages
* Print a PDF
The LIRE (Lucene Image REtrieval) library a simple way to create a Lucene index of image features for content based image retrieval (CBIR). The used features are taken from the MPEG-7 Standard: ScalableColor, ColorLayout and EdgeHistogram. Furthermore met
Katta is a scalable, failure tolerant, distributed, data storage for real time access.
Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and Hadoop mapfiles.
* Makes serving large or high load indices easy
* Serves very large Lucene or Hadoop Mapfile indices as index shards on many servers
* Replicate shards on different servers for performance and fault-tolerance
* Supports pluggable network topologies
* Master fail-over
* Fast, lightweight, easy to integrate
* Plays well with Hadoop clusters
* Apache Version 2 License
Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat.
Solr is a standalone enterprise search server with a web-services like API. You put documents in it (called "indexing") via XML over HTTP. You query it via HTTP GET and receive XML results.