-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Description
Current Behavior
I used tesseract 5.4.1 in WSL/Win10 and tesseract 5.0.1 in GImagereader/Win10 with different image files (fraktur newspaper and latin/Libreoffice dokument, 2 columns, all images in German language), and let the tesseract versions create both OCR-pdf and hocr output. The OCR pdf was ok and searchable and was displayed in PDF viewers with no errors.
After permanently failing to create a searchable OCR pdf with hocr-pdf from the hocr tools I've checked the syntax of the created hocr files with hocr-check and hocr-spec: numerous syntax errs were reported, explaining the failure of hocr-pdf. The created pdf displayed the image only, but didn't contain any text layer (pdftotext produced empty files). The pdf viewer displayed warnings, that the pdf structure is corrupted (streams missing or premature ending of streams)
Expected Behavior
correct syntax of created hocr files, which allows creation of searchable OCR pdfs by hocr-pdf
Suggested Fix
no idea
tesseract -v
tesseract -v (instance in WSL)
tesseract 5.4.1
leptonica-1.82.0
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
Found AVX
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.18
Operating System
Windows 10
Other Operating System
see above: 2 different tesseract versions on the same PC:
tesseract 5.4.1 in WSL/Win10 and tesseract 5.0.1 in GImagereader/Win10
uname -a
No response
Compiler
No response
CPU
i5-3570 @3.4 GHz - 8 GB Ram
Virtualization / Containers
WSL/Win10
Other Information
hocr-check let me suspect, that the hocr output from tesseract has syntax errs, which might be responsible for the failure of hocr-pdf
sample hocr files could be supported