Skip to content

GetAvailableLanguagesAsVector() can list unrelated files due to invalid extension handling #4416

@danpla

Description

@danpla

Current Behavior

The file extension handling code in

auto extPos = path.rfind(".traineddata");
if (extPos != std::string::npos) {
langs->push_back(path.substr(0, extPos));
}
only checks that the file name has a ".taineddata" substring rather than strictly ends with it. As the result, the GetAvailableLanguagesAsVector() method can treat unrelated files (in my case, *.traineddata.sha256) as languages.

Expected Behavior

No response

Suggested Fix

Use std::filesystem::path::extension().

tesseract -v

5.5.0

Operating System

No response

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions