-
Notifications
You must be signed in to change notification settings - Fork 10.1k
Description
Environment
ExcelTest_Bug.zip
ExcelTest_text_TesseractV4.txt
Tesseract Version: 5.2.0 vs. 4.1.1.-rc2-37-gcla5
Ubuntu 20.04.3 LTS
Current Behavior:
With the attached TIF image of an Excel file (in the zip), Tesseract version 5.2.0 extracts a minimal amount of text (only a single line "hiding rows 15 through 20"). However, in prior versions of Tesseract, namely the version 4.1.1 version noted above, but likely other versions as well, the amount of text extracted from the same TIF image is significantly larger (multiple lines of text, approximately 1K of text over multiple pages). Attached a separate text file with the output of the V4.x text output.
Expected Behavior:
Expecting version 5.2+ of Tesseract to at least replicate the behavior of prior versions in extracting text from this sample TIF.
Suggested Fix:
Correct the text extraction to match the output from previous Tesseract versions. Concerned with Tesseract's regression in ability to extract text from Excel files.