Survey of Text Line Segmentation Methods of Historical Documents

Survey of Text Line Segmentation Methods of Historical Documents

Survey of Text Line Segmentation Methods of Historical Documents Article written by Laurence Likeformann-Sulem , Abderrazak Zahour, Bruno Taconet(2006) Presenting: Erez Lefel and Koby Israel Introduction. 1 Text line extraction is generally seen as preprocessing

step for tasks such as Document structure extraction Printed character or handwriting recognition Text line extraction is most common in ancient and historical documents printed or handwritten.

.Introduction cont Characteristics and representation of text lines. 2 Some definitions: Baseline: fictitious line which follow and joins the lower part of the character body in a text line. Median line: fictitious line which follow and joins the upper part

of the character body in a text line. Upper line: fictitious line which joined the top of ascenders. Lower line: fictitious line which joins the bottom of decenders. . Characteristics and representation of text lines cont Overlapping components: components which are descenders and ascenders located in the region of an adjacent line Touching component: components which are ascenders

and descenders belonging to consecutive lines which are thus connected. . Characteristics and representation of text lines cont Text line segmentation: labeling process which consists in assigning the same label to spatially aligned units . Characteristics and representation of text lines cont

Influence of author style Baseline fluctuation: the baseline may vary due to writer movement. It may be straight, straight by segments or curved. Line spacing: lines that are widely spaced are easy to find, problem starts when lines spacing is very small. If exists at all. Insertions: words or short text lines may appear between the principal text lines or in the mergins.

. Characteristics and representation of text lines cont Influence of poor image quality Imperfect preprocessing: smudges, present of seeping ink or variable background intensity make image preprocessing difficult and produce binarization errors. Stroke fragmentation and merging: dots and broken strokes due to low quality images and/or binarization may produce many connected components

. Characteristics and representation of text lines cont Three main axes of document complexity for text line segmentation Text Line Segmentation. 3 Preprocessing 3.1 In an ideal situation, text line extraction would be

preformed on a clean document: without background noise and non-textual elements. the writing would be well contrasted. With as little fragmentation as possible.

In reality, preprocessing is often necessary. The Preprocessing methods has to be tailored to each document. All of the above has to be removed before using any text line extraction method. .Preprocessing cont Non textual elements like :

book bindings book sides thumb marks from someone holding the book open Can be removed upon criteria such as position and intensity level.

.Preprocessing cont Other non-textual element such as: Stamps Seals

Ornamentation ( ,) decorated initials using knowledge about the shape, the color or the position of these elements All of these can be removed .Preprocessing cont Extracting text from figures can also be performed using

texture or morphological filters Linear graphical elements such as big crosses (called St Andres crosses) appear in some of Flauberts manuscripts. Removing these elements is performed through GUI by Kalman filtering . Kalman filter From Wikipedia, the free encyclopedia

In statistics, the Kalman filter is a mathematical method named after Rudolf E. Kalman. Its purpose is to use measurements that are observed over time that contain noise (random variations) and other inaccuracies, and produce values that tend to be closer to the true values of the measurements and their associated calculated values. The Kalman filter has many applications in technology, and is an essential part of the development of space and

military technology. .Preprocessing cont Textual but unwanted elements such as bleed through text can be removed by:

Filtering Combining the back side image with the front side image .Preprocessing cont Binarization using global thresholding Usually does not work with historical documents, Thats because the background is not uniform.

Binarization using local thresholding Determining the threshold value based on the local properties of the image, e.g. pixel by pixel or region by region .Preprocessing cont Writing may be faint so that over-segmentation or

under-segmentation may occur. Projectionbased methods. 3.2 Projection-profiles are commonly used for printed document segmentation. The vertical projection-profile is obtained by summing pixel values along the horizontal axis for each y value. The gaps between the text lines in the vertical direction can be observed.

Profile(y) .Projectionbased methods cont The vertical profile is not sensitive to writing fragmentation. Other ways for obtaining a profile curve. Counting connected component Projecting black/white transition

.Projectionbased methods cont .Projectionbased methods cont Profile curve can be smoothed by median filter or gaussian filter to eliminate local maxima. The profile curve is then analyzed to find its maxima and minima

Cuts are made at significant minima. .Projectionbased methods cont Drawbacks: Short lines will provide low picks Narrow lines will not produce significant peaks In the nave form, cant handle skew in the text

.Projectionbased methods cont In Shapiros work, the global orientation (skew angle) of a handwritten page is first searched by applying a Hough transform on the entire image. Once this skew angle is obtained, projections are achieved along this angle. Smearing methods. 3.3 For printed and binarized documents, smearing methods

can be applied. Consecutive black pixels along the horizontal direction are smeared: the white space between them is filled with black pixels if their distance is within a predefined threshold. The bounding boxes of the connected components in the smeared image enclose text lines. .Smearing methods cont

Grouping methods. 3.4 Methods consist in building alignments by aggregating units in a bottom-up strategy. The units may be pixels or higher level, such as connected components, blocks etc. Units are joined together to form alignments. The joining scheme relies on both local and global criteria.

.Grouping methods cont Every method has to face the following Initiating alignments: one or several seeds for each alignment Defining a units neighborhood for reaching the next unit (it is

generally a rectangular or angular area). Solving conflicts: as one unit may belongs to several alignments under construction a choice has to be made: discard one alignment or keep both alignments. .Grouping methods cont Defining a units neighborhood for reaching the next unit:

.Grouping methods cont Contrary to printed documents, a simple nearest- neighbor joining scheme would often fail to group complex handwritten units, as the nearest neighbor often belongs to another line ? .Grouping methods cont

When having a conflict, choice has to be made! Decision can be made by alignment quality measures given. Decision can be made by comparing the quality measure of the competing units in the neighborhood in the next iteration. Quality measures generally include the strength of the

alignment (number of units included) Other quality elements may concern component size, component spacing etc. .Grouping methods cont Example of text lines extracted on church registers .Grouping methods cont Likforman-sulem and Faure have developed an iterative

method based on perceptual grouping for forming alignments, which has been applied to handwritten pages. Anchors are detected by selecting connected components elongated in specific directions (0 , 45 , 90 , 125 ) Each of these anchors becomes the seed of an alignment. First, each anchor, then each alignment, is extended to the left and to the right according to given rules. A penalty is given when the alignment includes anchors of different directions.

.Grouping methods cont Methods based on the Hough transform. 3.5 The Hough transform is a very popular technique for finding straight lines in images This method can extract oriented text lines and sloped annotations under the assumption that such lines are almost straight

.Methods based on the Hough transform cont The centroids of the connected components are the units for the Hough transform. A set of aligned units in the image along a line with parameters (, ) is included in the corresponding cell (, ) is included in the corresponding cell (, ) is included in the corresponding cell (, ) of the Hough domain .Methods based on the Hough transform cont

Repulsive-Attractive network method. 3.6 This method is based on Repulsive-Attractive forces. Method works directly on grey-level images and consists in iteratively adapting the yposition of a predefined number of baseline units. This method has been applied to ancient Ottoman document archives and latin texts.

Repulsive-Attractive network method ?how it works Baselines are constructed one by one from the top of the image to bottom. Pixels of the image act as attractive forces for baselines. Already extracted baselines act as repulsive forces. The baseline to be extracted is initialized just under the previously examined one, in order to be repelled by it

and attracted by the pixels of the line below. The lines must have similar length. The result is a set of baselines, each one passing through word bodies. .Repulsive-Attractive network method cont Pseudo baselines extracted by a Repulsive-Attractive network on Ancient Ottoman text

Processing of overlapping and touching components. 3.7 Overlapping and touching components are the main challenged for text line extraction since no white space is left between lines. .Processing of overlapping and touching components cont Detection of ambiguous components can be done in several ways Components size.

Component belongs to several alignments. Component belongs to no alignment. .Processing of overlapping and touching components cont Once component is detected as ambiguous it must be classified to one of the two categories above

Component is an overlapping component(belongs to upper/lower alignment) Component is touching component In grouping methods (seen in 3.4) its common to use the component ambiguity attribute in order to calculate whether to add the component to the group or not.

.Processing of overlapping and touching components cont .Processing of overlapping and touching components cont In Likforman-Sulem method, touching and overlapping components are detected after the text line extraction process described in 3.5 (Methods base on Hough transform). These components are those which are intersected by at least two different lines (, ) is included in the corresponding cell (, ) corresponding to primary cells of

validated alignments. .Processing of overlapping and touching components cont Zahours method for detecting touching and overlapping components:

Cut the text into 8 columns. A projection-profile is performed on each column. In each histogram, two consecutive minima delimit a text block Classify text blocks into 3 categories small, average, big (using k-means algorithm) Overlapping components necessarily belong to big physical blocks.

Using average text block from average and small groups in order to decide to how many pieces the big text blocks should be cut into. Non Latin documents 3.8 The inter-line space in Latin documents is filled with single dots, ascenders and descenders. The Arabic script is connected and cursive. ancient Arabic documents include diacritical points (

) Ancient Hebrew documents can include decorated words .Non Latin documents- cont In the alphabets of some Indian scripts many basic characters have an horizontal line (the head line) in the upper part

Ancient Arabic documents 3.8.1 The writing in these documents is very dense, and the line spacing is quite small. The method developed in Zahour et al. begins with the detection of overlapping and touching components presented in 3.7 Ancient Hebrew documents 3.8.1

The manuscripts studied in Likforman-Sulem et al. are written in Hebrew, using Dfus letters, as most characters are made of horizontal and vertical strokes. The Scrolls, intended to be used in the synagogue, do not include diacritics, so there is no separation between words or sentences. .Ancient Hebrew documents cont 3.8.2 Cases of overlapping components occur as characters

such as Lamed (), Kaf (), and final letters (,,,). Since the majority of characters are composed of one connected component, it is more convenient to perform text line segmentation from connected components units. (3.5 Hough transform) Summary Analysis of historical document images is a relatively

new domain. These methods have been developed within several projects which perform transcript mapping, authentication, word mapping or word recognition. As the need for recognition and mapping of handwritten material increases, text line segmentation will be used more and more. .Summary cont Contrary to printed modern documents, a historical

document has unique characteristics due to style, artistic effect and writer skills There is no universal segmentation method which can fit all these documents ?Questions

Recently Viewed Presentations

  • How to win a DECA ROLE PLAY - Lewis-Palmer High School

    How to win a DECA ROLE PLAY - Lewis-Palmer High School

    Tip#2: explain every Performance indicator- in detail. This is the MOST important part of the role-play! Example: Determine factors affecting business risk. While there are a number of factors that affect our level of risk, we must be able to...
  • Curtin PowerPoint template 2010

    Curtin PowerPoint template 2010

    It contains links for 'Originality Checking', i.e., Turnitin - text matching software - to assist students avoid plagiarism . Use the Turnitin links to check candidacy proposal and thesis chapters . See . Research Integrity . website and . What...
  • UNM Turnitin Instructor Training - 3/7/2014

    UNM Turnitin Instructor Training - 3/7/2014

    Examine the cost of course materials (including textbooks) for UNM students to ensure student success. ... that provide materials to all enrolled students on or before the first day of class with automated billing to student bursar accounts.
  • Intraoperative Hyperglycemia in Adult Patients Undergoing an ...

    Intraoperative Hyperglycemia in Adult Patients Undergoing an ...

    Acute traumatic coagulopathic phase)-Then there . is trauma induced coagulopath. y which is due to hemodilution, infusion of hypo-coagulable blood products such as PRBCs , it can be due to acidosis, hypothermia, hemorrhaging, and depletion of clotting factors ... INTRAOPERATIVE...
  • Sunlight - University of St Andrews

    Sunlight - University of St Andrews

    Induction Equation Clare E Parnell School of Mathematics and Statistics Induction Equation Diffusion Equation Diffusion Equation Diffusion - Evolution of the magnetic field lines Diffusion - Evolution of Bz versus x Diffusion - Evolution of jy versus x Advection Equation...
  • METRIC CONVERSION - 8th Grade Science

    METRIC CONVERSION - 8th Grade Science

    km hm dam m dm . cm. mm. Example #1: k h d u d c m. km hm dam m dm . cm. mm. Move to new unit, counting jumps and . noticing the direction of the jump! One...
  • The Sheridan Baker Thesis Machine - Lawndale High School

    The Sheridan Baker Thesis Machine - Lawndale High School

    Traditional grading procedures may offend educational purists, but public school systems require pragmatic approaches to evaluation. The Sheridan Baker Thesis Machine Step 1: TOPIC State the topic under consideration. cats freshman composition grades Step 2: ISSUE State the specific issue...
  • Pathways to Success Paying the Bills (lesson #10)

    Pathways to Success Paying the Bills (lesson #10)

    You may be in debt longer to make the monthly payment lower. ... Tell them why you can't pay the full bill, tell them how much you are able to send as part of the payment. Keep a copy of...