Skip to content

Regression in extracting text from Excel TIF image #4014

@jwmepiq

Description

@jwmepiq

Environment

ExcelTest_Bug.zip
ExcelTest_text_TesseractV4.txt

Tesseract Version: 5.2.0 vs. 4.1.1.-rc2-37-gcla5
Ubuntu 20.04.3 LTS

Current Behavior:

With the attached TIF image of an Excel file (in the zip), Tesseract version 5.2.0 extracts a minimal amount of text (only a single line "hiding rows 15 through 20"). However, in prior versions of Tesseract, namely the version 4.1.1 version noted above, but likely other versions as well, the amount of text extracted from the same TIF image is significantly larger (multiple lines of text, approximately 1K of text over multiple pages). Attached a separate text file with the output of the V4.x text output.

Expected Behavior:

Expecting version 5.2+ of Tesseract to at least replicate the behavior of prior versions in extracting text from this sample TIF.

Suggested Fix:

Correct the text extraction to match the output from previous Tesseract versions. Concerned with Tesseract's regression in ability to extract text from Excel files.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions