Omnifont OCR für Erkennung von Frakturschrift. ABBYY FineReader XIX ist eine spezielle Version der vielfach ausgezeichneten OCR Software FineReader für die Erkennung von Texten, die zwischen 1800 und 1938 in Frakturschrift gedruckt wurden.
Newspaper collections are the subject of an increasing number of large-scale digitisation projects. In Papers Past (http://paperspast.natlib.govt.nz), a collection of over a million newspaper pages, the introduction of full-text search has made a wealth of information findable that was previously hidden. The search feature is dependent on text extracted from the newspaper page images with Optical Character Recognition (OCR), so any improvement in OCR accuracy will add value to the collection by improving our users' chances of finding useful information.
This article details the work undertaken by the National Library of Australia Newspaper Digitisation Program on identifying and testing solutions to improve OCR accuracy in large scale newspaper digitisation programs. In 2007 and 2008 several different solutions were identified, applied and tested on digitised material now available in the Australian Newspapers Digitisation Program beta service <http://ndpbeta.nla.gov.au/ndp/del/home>. This article gives a state of the art overview of how OCR software works on newspapers, factors that effect OCR accuracy, methods of measuring accuracy, methods of improving accuracy, and testing methods and results for specific solutions that were considered viable for large scale text digitisation projects.
Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we describe visualization and summarization techniques that can be used to present the extracted events.
zerlegt digital elektronische, Papier-, Mikrofilm- oder Mikrofiche- Dokumente in ihre Bestandteile und schafft durchsuchbare Inhalte bei gleichzeitigem
The DjVuLibre XML Tools provide for editing the metadata, hyperlinks and hidden text associated with DjVu files. Unlike djvused(1) the DjVuLibre XML Tools rely on the XML technology and can take advantage of XML editors and verifiers.
OCRopus(tm) is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.
hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation.
OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. This server allows you to use the system through your web browser.