Monday, April 13, 2015

Binarising manuscript images

The first step in trying to recognise words on a page is to obtain a good back and white representation of the colour original. Typically what you get with manuscript images, once you follow the standard binarisation techniques (like Sauvola's) is that thin pen-strokes disappear and the writing breaks up. This makes recognition of lines and words very difficult. Here's part of the Brewster journal (Biodiversity Heritage Library) rendered using the Ocropus toolset:

As you can see, the thin pen-strokes have broken up and the text is difficult to read. But then I realised that the information about these thin strokes was still in the original greyscale image. By comparing the local broken-characters with that I should be able to extend them so long as they were darker than the local pixel density. For this to work I had to create several copies of the image:

  1. A gaussian blurred version of the original greyscale, with a blur radius of 1/80th of the image height.
  2. The greyscale image
  3. The regular binarised image
  4. A mask, generated by blurring the binarised image, and rendering all the blurred pixels as pure black

By examining each pixel in 3. if it was black I recomputed the "blob" of connected pixels directly from the greyscale image at the same coordinates. To decide whether it should be black or white I used the value at the same coordinates in 1, which is effectively the local average pixel value. Since text and background will be included in the calculation the average at each point is likely to be somewhat higher than the background density, so any pixels at least that dark and in the vicinity of the originally recognised text are very likely to be the missing fragments of letters. Judge for yourself:

To stop this extension bleeding into the surrounding parts of the greyscale image I used a mask to restrict how far the local extensions would go. The blur radius I used was 1/200th of the image height. I've tried it on printed books, typescripts and manuscripts and it seems to work quite well. The beauty of this is how simple it is: no machine learning, no fancy transformations. The small lines under the words could be removed by a blue-filter on the original colour image. But I'm only interested in word-recognition, not recognition of letters, so mostly these lines do no damage.