InstaSearch is an Eclipse plug-in for doing fast text search in the workspace. The search is performed instantly as-you-type and resulting files are displayed in an Eclipse view. It is a lightweight plug-in based on Apache Lucene search engine. Each file then can be previewed using few most matching and relevant lines. A double-click on the match leads to the matching line in the file. Main Features Instantly shows search results Shows suggestions using auto-completion
builds on the well-known Lucene search engine library to create an enterprise search server with a simple HTTP/XML interface. Using Solr, large collections of documents can be indexed
In Bibliothekskatalogen kommt der 'Treffersortierung nach Relevanz' immer größere Bedeutung zu. Der Aufsatz beschreibt verschiedene Möglichkeiten zur Optimierung des Trefferrankings am Beispiel des Lucene-basierten OPACs der UB Heidelberg. Zur Bestimmung der Relevanz können die Inhalte einzelner Datenfelder analysiert und gewichtet, es können Kriterien der Popularität, der Verfügbarkeit oder der Bewertung eines Titels, oder auch Nutzerprofile berücksichtigt werden. Im Beitrag werden verschiedene Gewichtungsmöglichkeiten und Lösungsansätze für weitere Kriterien aufgezeigt.
Imagine you can see 160 years of history, all on one screen. You can zoom and pan, you can look at a particular day, you can even do a search. And when you do, the results come up not as a list, but as a heat map that shows where in history that topic appears, and how often.
Katta is a scalable, failure tolerant, distributed, data storage for real time access.
Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and Hadoop mapfiles.
* Makes serving large or high load indices easy
* Serves very large Lucene or Hadoop Mapfile indices as index shards on many servers
* Replicate shards on different servers for performance and fault-tolerance
* Supports pluggable network topologies
* Master fail-over
* Fast, lightweight, easy to integrate
* Plays well with Hadoop clusters
* Apache Version 2 License
Welcome to TuQS! Turnguard's QuadStore is the first draft of an own implementation of a QuadStore with main focus on data-retrieval speed. TuQS can be queried and updated using openrdf's SAIL API. Please choose a repository here. * Features o SAIL accessible o True QuadStore with GraphSupport o HighSpeed regex SPARQL filters o Userrights on TripleBasis o Extendable to a QuintStore (or more generally to an n-Store) o Cachable SPARQL Queries for further speed improvement o Clusterable o Federationable o FullTextSearchable
Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. You can find the latest release on the download page. See the Getting Started guide for instructions on how to start using Tika.
Katta is a scalable, failure tolerant, distributed, data storage for real time access.
Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and Hadoop mapfiles.
* Makes serving large or high load indices easy
* Serves very large Lucene or Hadoop Mapfile indices as index shards on many servers
* Replicate shards on different servers for performance and fault-tolerance
* Supports pluggable network topologies
* Master fail-over
* Fast, lightweight, easy to integrate
* Plays well with Hadoop clusters
* Apache Version 2 License
<property> <name>http.agent.name</name> <value></value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> <property> <name>http.agent.description</name> <value></value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value></value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value></value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property>
While Lucene has long offered these capabilities, its native capabilities are not intended for large semi-structured document collections (or documents with very different schemas). For this reason we developed SIREn - Semantic Information Retrieval Engine - a Lucene plugin to overcome these shortcomings and efficiently index and query RDF, as well as any textual document with an arbitrary amount of metadata fields.
Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat.
vom 9. Mrz. 2009
The last time I wrote about integrating Apache Nutch with Apache Solr (about two years ago), it was quite difficult to integrate the two components - you had to apply patches, hunt down required components from various places etc. Now there is easier way.
Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat.
N. Ferro, und D. Harman. Multilingual Information Access Evaluation I. Text Retrieval Experiments, Volume 6241 von Lecture Notes in Computer Science, Springer, Berlin / Heidelberg, (2010)
D. Hiemstra, und C. Hauff. Multilingual and Multimodal Information Access Evaluation, Volume 6360 von Lecture Notes in Computer Science, Seite 64--69. Berlin, Springer Verlag, (2010)
U. Schindler, und I. Drost. Java Magazin, (2010)Zusätzlich interessante Punkte die im Artikel erwähnt werden:
1) Die Häufigkeit einzelner Suchanfragen ist meist zipf-verteilt.
2) Abstandsberechnung bei Geodaten über Haversinus.
3) Cartesian Tiers
4) Wissenschaftliches Infosystem PANGAEA
5) KML Regionen Dokumentation von Google
6) Geohshes.