Skip to content

Conversation

EnodoGH
Copy link
Contributor

@EnodoGH EnodoGH commented Mar 27, 2025

punct_stripped() may currently produce invalid empty intervals (where end < start) for all-punctuation words. As a result, EqualIgnoringCaseAndTerminalPunct() crashes in this loop due to an unsigned integer underflow in w1end - w1start:

for (unsigned i = 0; i < w1end - w1start; i++) {
if (uchset->to_lower(word1.unichar_id(w1start + i)) !=
uchset->to_lower(word2.unichar_id(w2start + i))) {
return false;
}
}

Bug was introduced with the following commit:


Revision: 97048fe
Author: Stefan Weil sw@weilnetz.de
Date: 09/10/2021 20:50:39
Message:
ccstruct: Fix some signed/unsigned compiler warnings

Remove also a local buffer in function REJMAP::print.


There was no unsigned integer underflow before, because the interval was signed.

@egorpugin egorpugin merged commit d39177e into tesseract-ocr:main Mar 27, 2025
1 check passed
@stweil
Copy link
Member

stweil commented Mar 27, 2025

@egorpugin, please wait a little bit longer before merging non-trivial pull requests. Otherwise there is not enough time for others to review the changes.

I am not sure that this change is the right solution. punct_stripped always could produce intervals where end was smaller than start. Dict::valid_bigram has special handling for start >= end. Now start > end no longer can happen, but those cases now return start == end, so the tests still work.

Therefore the fix is okay.

@stweil
Copy link
Member

stweil commented Mar 27, 2025

@EnodoGH, can you provide a test image which triggers the crash?

@EnodoGH
Copy link
Contributor Author

EnodoGH commented Mar 27, 2025

~ punct_stripped always could produce intervals where end was smaller than start. Dict::valid_bigram has special handling
for start >= end. Now start > end no longer can happen, but those cases now return start == end, so the tests still work.

Therefore the fix is okay.

Yes, I think this is also the right place to do it, because otherwise you continue having different kinds of empty intervals for no good reason. Neither EqualIgnoringCaseAndTerminalPunct() nor Dict::valid_bigram() needs them semantically, which is why it feels a bit artificial to change EqualIgnoringCaseAndTerminalPunct() to expect them.

@EnodoGH
Copy link
Contributor Author

EnodoGH commented Mar 27, 2025

@EnodoGH, can you provide a test image which triggers the crash?

I probably could, but it would be highly non-trivial. It's one of slightly fewer than one million images, which our software takes a few days to process. There's a good chance it's a customer image, for which I'll need to get permission before passing it on to you. In that image, there are multiple calls to the Tesseract library, and I'd have to isolate the exact call and the exact data that was passed to it.

I'll try to do all that if it's too hard for you to reproduce the crash from scratch. For now, all I can tell you is that we reproduced the crash in EqualIgnoringCaseAndTerminalPunct() on two different machines. On mine, I managed to identify punct_stripped() as the reason the crash happened, fixed it, and the crash was gone for good on both machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants