-
Notifications
You must be signed in to change notification settings - Fork 10.1k
Description
Basic Information
tesseract 5.3.0-19-ga3b9ac, compiled with --disable-legacy
Operating System
macOS 13 Ventura
Compiler
clang 14.0
Current Behavior
When Tesseract is compiled with --disable-legacy
, hOCR output reports each line as being upside-down (textangle 180
) and omits baseline information.
Steps to reproduce:
./configure --disable-legacy
./tesseract some-image.jpg output hocr
In the generated output.hocr
file, ocr_line
entries look like this:
<span class='ocr_line' id='line_1_142' title="bbox 1334 3054 2119 3088; textangle 180; x_size 34; x_descenders 8; x_ascenders 9">
Expected Behavior
If orientation information isn't available I'd expect the image to always be treated as if it were page-up. So entries should look like this:
<span class='ocr_line' id='line_1_142' title="bbox 1334 3054 2119 3088; baseline 0 -8; x_size 34; x_descenders 8; x_ascenders 9">
Suggested Fix
Internally, it looks like the issue is that:
ColumnFinder::text_rotation_
is initialized to a null vector. When the legacy engine is disabled, theColumnFinder::CorrectOrientation
function does not get called, and so this vector remains null.- This null vector gets propagated to
PageIterator::Orientation
, which does not handle this case correctly, as it converts this null vector toORIENTATION_PAGE_DOWN
-tesseract/src/ccmain/pageiterator.cpp
Line 585 in a3b9acf
if (up_in_image.y() > 0.0F) { - The HOCR renderer then maps this orientation value to
textangle 180
and omits baseline info
Some fixes I tested locally were to change the initialization of ColumnFinder::text_rotation_
to be the same as the norotation
value in ColumnFinder::CorrectOrientation
, or to change the logic in PageIterator::Orientation
to handle null rotation vectors by mapping them to ORIENTATION_PAGE_UP
. I'm happy to submit a PR but I'm not sure the preferred way to go.