Text line extraction for historical document images Academic Article uri icon

abstract

  • In this paper we present a language independent global method for automatic text line extraction. The proposed approach computes an energy map of a text image and determines the seams that pass across and between text lines. In this work we have developed two algorithms along this novel idea, one for binary images and the other for grayscale images. The first algorithm works on binary document images and assumes it is possible to extract the components along text lines. The seam passes on the middle and along the text line, l, and marks the components that make the letters and words of l. It then assigns the unmarked component to the closest text line. The second algorithm works directly on grayscale document images. It computes the distance transform directly from the grayscale images and generates two types of seams: medial seams and separating seams. The medial seams determine the text lines and the separating seams define the upper and lower boundaries of these text lines. Moreover, we present a new benchmark dataset of historical document images with various types of challenges. The dataset contains a groundtruth for text line extraction and it contains samples with different languages such as: Arabic, English and Spanish. A binary dataset is used to test the binary algorithm. We performed various experimental results using our two algorithms on the mentioned datasets and report segmentation accuracy. We also compare our algorithms with the state-of-the-art text line segmentation methods.

publication date

  • January 1, 2014