Text Extraction via an Edge-bounded Averaging and a Parametric Character
Model
J. Fan. HPL-2002-294. Hewlett Packard Laboratories, (2002)
Zusammenfassung
We present a text extraction algorithm that is deterministic and parametric.
The algorithm relies on three basic assumptions about text: color/luminance
uniformity of the interior region, closed boundaries of sharp edges
and the consistency of local contrast. The algorithm is basically
independent of the character alphabet, text layout, font size and
orientation. The heart of this algorithm is an edge- bounded averaging
for the classification of smooth regions that enhances robustness
against noise without sacrificing boundary accuracy. We have also
developed a verification process to clean up the residue of incoherent
segmentation. Our framework provides a symmetric treatment for both
regular and inverse text. We have proposed three heuristics for identifying
the type of text from a cluster consisting of two types of pixel
aggregates. Finally, we have demonstrated the advantages of the proposed
algorithm over adaptive thresholding and block-based clustering methods
in terms of boundary accuracy, segmentation coherency, and capability
to identify inverse text and separate characters from background
patches. Notes: To be published in and presented at Electronic Imaging
(SPIE) 2003, 23 January 2003, San Jose, CA
%0 Report
%1 Fan2002
%A Fan, Jian
%D 2002
%K character compound document extraction; image optical processing; recognition segmentation; text
%N HPL-2002-294
%P 20
%T Text Extraction via an Edge-bounded Averaging and a Parametric Character
Model
%U http://www.hpl.hp.com/techreports/2002/HPL-2002-294.html; http://www.hpl.hp.com/techreports/2002/HPL-2002-294.pdf
%X We present a text extraction algorithm that is deterministic and parametric.
The algorithm relies on three basic assumptions about text: color/luminance
uniformity of the interior region, closed boundaries of sharp edges
and the consistency of local contrast. The algorithm is basically
independent of the character alphabet, text layout, font size and
orientation. The heart of this algorithm is an edge- bounded averaging
for the classification of smooth regions that enhances robustness
against noise without sacrificing boundary accuracy. We have also
developed a verification process to clean up the residue of incoherent
segmentation. Our framework provides a symmetric treatment for both
regular and inverse text. We have proposed three heuristics for identifying
the type of text from a cluster consisting of two types of pixel
aggregates. Finally, we have demonstrated the advantages of the proposed
algorithm over adaptive thresholding and block-based clustering methods
in terms of boundary accuracy, segmentation coherency, and capability
to identify inverse text and separate characters from background
patches. Notes: To be published in and presented at Electronic Imaging
(SPIE) 2003, 23 January 2003, San Jose, CA
@techreport{Fan2002,
abstract = {We present a text extraction algorithm that is deterministic and parametric.
The algorithm relies on three basic assumptions about text: color/luminance
uniformity of the interior region, closed boundaries of sharp edges
and the consistency of local contrast. The algorithm is basically
independent of the character alphabet, text layout, font size and
orientation. The heart of this algorithm is an edge- bounded averaging
for the classification of smooth regions that enhances robustness
against noise without sacrificing boundary accuracy. We have also
developed a verification process to clean up the residue of incoherent
segmentation. Our framework provides a symmetric treatment for both
regular and inverse text. We have proposed three heuristics for identifying
the type of text from a cluster consisting of two types of pixel
aggregates. Finally, we have demonstrated the advantages of the proposed
algorithm over adaptive thresholding and block-based clustering methods
in terms of boundary accuracy, segmentation coherency, and capability
to identify inverse text and separate characters from background
patches. Notes: To be published in and presented at Electronic Imaging
(SPIE) 2003, 23 January 2003, San Jose, CA},
added-at = {2011-03-27T19:35:34.000+0200},
author = {Fan, Jian},
biburl = {https://www.bibsonomy.org/bibtex/23be6dd9c1d2981f25515b4290d599f91/cocus},
file = {:./HPL-2002-294.pdf:PDF},
institution = {Hewlett Packard Laboratories},
interhash = {9da93834e0dd92852dd5bd428bb52186},
intrahash = {3be6dd9c1d2981f25515b4290d599f91},
keywords = {character compound document extraction; image optical processing; recognition segmentation; text},
number = {HPL-2002-294},
pages = 20,
timestamp = {2011-03-27T19:35:39.000+0200},
title = {Text Extraction via an Edge-bounded Averaging and a Parametric Character
Model},
url = {http://www.hpl.hp.com/techreports/2002/HPL-2002-294.html; http://www.hpl.hp.com/techreports/2002/HPL-2002-294.pdf},
year = 2002
}