Skip to content

HOCR output always sets textangle 180 and omits baseline info if Tesseract is compiled with --disable-legacy  #4010

@robertknight

Description

@robertknight

Basic Information

tesseract 5.3.0-19-ga3b9ac, compiled with --disable-legacy

Operating System

macOS 13 Ventura

Compiler

clang 14.0

Current Behavior

When Tesseract is compiled with --disable-legacy, hOCR output reports each line as being upside-down (textangle 180) and omits baseline information.

Steps to reproduce:

./configure --disable-legacy
./tesseract some-image.jpg output hocr

In the generated output.hocr file, ocr_line entries look like this:

<span class='ocr_line' id='line_1_142' title="bbox 1334 3054 2119 3088; textangle 180; x_size 34; x_descenders 8; x_ascenders 9">

Expected Behavior

If orientation information isn't available I'd expect the image to always be treated as if it were page-up. So entries should look like this:

<span class='ocr_line' id='line_1_142' title="bbox 1334 3054 2119 3088; baseline 0 -8; x_size 34; x_descenders 8; x_ascenders 9">

Suggested Fix

Internally, it looks like the issue is that:

  1. ColumnFinder::text_rotation_ is initialized to a null vector. When the legacy engine is disabled, the ColumnFinder::CorrectOrientation function does not get called, and so this vector remains null.
  2. This null vector gets propagated to PageIterator::Orientation, which does not handle this case correctly, as it converts this null vector to ORIENTATION_PAGE_DOWN -
    if (up_in_image.y() > 0.0F) {
  3. The HOCR renderer then maps this orientation value to textangle 180 and omits baseline info

Some fixes I tested locally were to change the initialization of ColumnFinder::text_rotation_ to be the same as the norotation value in ColumnFinder::CorrectOrientation, or to change the logic in PageIterator::Orientation to handle null rotation vectors by mapping them to ORIENTATION_PAGE_UP. I'm happy to submit a PR but I'm not sure the preferred way to go.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions