fix: reuse easyocr models from docling cache dir #743

smoya · 2025-05-15T11:36:20Z

This PR configures docling parser for reusing the EasyOCR models downloaded (included in the docker image) when running OCR in image files.
Without this fix, models were being downloaded again (duplicated) in the ~/.EasyOCR/model dir whenever docling needed to run OCR on image files (note that we don't enable OCR in PDF files).

Models downloaded by docling:

❯ ls -la ~/.cache/docling/models/EasyOcr
.rw-r--r--@ 83M smoya 15 May 12:22 craft_mlt_25k.pth
.rw-r--r--@ 15M smoya 15 May 12:22 english_g2.pth
.rw-r--r--@ 15M smoya 15 May 12:22 latin_g2.pth

Models being duplicated by EasyOCR prior to this fix:

❯ ls -la ~/.EasyOCR/model
.rw-r--r--@ 83M smoya 15 May 12:23 craft_mlt_25k.pth
.rw-r--r--@ 15M smoya 15 May 12:23 latin_g2.pth

Wondering if we would need to also include models for other languages in advance, as I can see we only download english and latin.

smoya · 2025-05-15T11:48:23Z

projects/pgai/tests/vectorizer/cli/test_vectorizer_document.py

+        assert "Use cases include providing chatbot" in chunks_str
+
+        # electromagnetic_radiation.docx
+        assert "All forms of EMR travel at the speed of light in a vacuum" in chunks_str


We can remove this test and save some KB but I added it in order to prove no extra models were being downloaded during the parsing.

Askir · 2025-05-15T11:50:25Z

projects/pgai/pgai/vectorizer/parsing.py

+        basic_pipeline_options = PdfPipelineOptions(
+            do_ocr=False,  # we do not want to do OCR in PDF (yet)
+            artifacts_path=self.cache_dir if os.path.isdir(self.cache_dir) else None,
+        )  # pyright: ignore[reportCallIssue]
+
+        with_ocr_pipeline_options = basic_pipeline_options
+        with_ocr_pipeline_options.do_ocr = True


should we actually enable ocr in pdfs? I thought it was enabled tbh

We decided to leave this option disabled due to the fact it makes PDF parsing process way longer and it is not very clear the user intent is to run ocr on the images inside the PDF.
I think we should make it configurable from the parser when creating a vectorizer. Either keep it disabled by default or enabled, still can be a discussion we want to retake.

JamesGuthrie

I'd suggest to rename the title of this commit. Based on the actual changes it looks like it should be something like: feat: use ocr on images?

JamesGuthrie

Ignore my previous comment, LGTM.

fix: reuse easyocr models from docling cache dir

6f3f787

smoya requested a review from a team as a code owner May 15, 2025 11:36

smoya temporarily deployed to internal-contributors May 15, 2025 11:36 — with GitHub Actions Inactive

smoya mentioned this pull request May 15, 2025

test: add image ocr test case #733

Closed

smoya commented May 15, 2025

View reviewed changes

Askir reviewed May 15, 2025

View reviewed changes

JamesGuthrie approved these changes May 15, 2025

View reviewed changes

test: ensure no extra models were downloaded during document parsing

54719d1

smoya temporarily deployed to internal-contributors May 15, 2025 12:29 — with GitHub Actions Inactive

smoya merged commit 647985e into main May 15, 2025
14 checks passed

smoya deleted the sergio/reuse-ocr-models branch May 15, 2025 12:45

github-actions bot mentioned this pull request May 15, 2025

chore(main): release pgai 0.10.4 #744

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: reuse easyocr models from docling cache dir #743

fix: reuse easyocr models from docling cache dir #743

Uh oh!

smoya commented May 15, 2025 •

edited

Loading

Uh oh!

smoya May 15, 2025

Uh oh!

Askir May 15, 2025

Uh oh!

smoya May 15, 2025 •

edited

Loading

Uh oh!

JamesGuthrie left a comment

Uh oh!

JamesGuthrie left a comment

Uh oh!

Uh oh!

Uh oh!

fix: reuse easyocr models from docling cache dir #743

fix: reuse easyocr models from docling cache dir #743

Uh oh!

Conversation

smoya commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smoya May 15, 2025

Choose a reason for hiding this comment

Uh oh!

Askir May 15, 2025

Choose a reason for hiding this comment

Uh oh!

smoya May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JamesGuthrie left a comment

Choose a reason for hiding this comment

Uh oh!

JamesGuthrie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

smoya commented May 15, 2025 •

edited

Loading

smoya May 15, 2025 •

edited

Loading