Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. You can find the latest release on the download page. See the Getting Started guide for instructions on how to start using Tika.
Katta is a scalable, failure tolerant, distributed, data storage for real time access.
Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and Hadoop mapfiles.
* Makes serving large or high load indices easy
* Serves very large Lucene or Hadoop Mapfile indices as index shards on many servers
* Replicate shards on different servers for performance and fault-tolerance
* Supports pluggable network topologies
* Master fail-over
* Fast, lightweight, easy to integrate
* Plays well with Hadoop clusters
* Apache Version 2 License