Skip to content

PSM_AUTO with jpn_vert gives garbage if the legacy engine is disabled at compile time #3997

@danpla

Description

@danpla

Basic Information

Tesseract 5.3.0

Operating System

No response

Other Operating System

Windows 7

uname -a

MINGW32_NT-6.1-7601 PC 3.3.6-341.x86_64 2022-11-20 15:12 UTC x86_64 Msys

Compiler

GCC 12.2.0

Virtualization / Containers

No response

CPU

Intel Core i7 Q720

Current Behavior

If tesseact is built without the legacy engine (--disable-legacy), recognizing vertical Japanese text with jpn_vert (from tessdata_fast) and PSM_AUTO (--psm 3) gives garbage. Here is an example image:
1
Here is the output of tesseract 1.png stdout -l jpn_vert --psm 3:

…4

/
\

09$2pY
コ べ
ほり メー14

Here is the correct (albeit with some OCR errors) output that I get either without --disable-legacy, or when using PSM_SINGLE_BLOCK_VERT_TEXT (--psm 5) explicitly regardless of whether the legacy engine is enabled:

ケー タイ は
カバ パ バン に
入れ て ある し

Expected Behavior

No response

Suggested Fix

If this is a kind of feature rather than a bug, it should probably be documented somewhere.

Other Information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions