Tuesday, October 14, 2014

Improved Line recognition

I managed to improve line-recognition, which has been plaguing the effectiveness of later stages. Until you can do that in manuscripts, with lines that are not horizontal, evenly spaced or straight, you have no chance of even recognising words reliably. My previous method, by first dividing the pages into small rectangles, could detect the same line two or three times over - for example, once for the ascenders, descenders and the main body of the line. The new method first blurs the image, then subdivides it into narrow vertical strips. The strips are then reduced to a single pixel in width by averaging them horizontally. This produces a graph that indicates the rise and fall of blackness within the strip. But since the data has many small peaks and troughs that aren't really interesting, I first apply a smoothing function before trying to detect the main peaks of blackness. These will very likely correspond to the black lines of type or writing. The final step is to join up the lines detected in the strips by horizontally aligning them as before. The result is very good line-recognition on most of the examples. Here's how the De Roberto manuscript, which is fairly average in difficulty, looks with the lines recognised on top of the blurred image:

A side-effect of this approach is that it should improve word-recognition, not only by helping to locate words, but also by joining up word-fragments through blurring. However, I'm running out of time now as the deadline of November 3 looms.