Skip to content

Conversation

smoya
Copy link
Contributor

@smoya smoya commented May 15, 2025

This PR configures docling parser for reusing the EasyOCR models downloaded (included in the docker image) when running OCR in image files.
Without this fix, models were being downloaded again (duplicated) in the ~/.EasyOCR/model dir whenever docling needed to run OCR on image files (note that we don't enable OCR in PDF files).

Models downloaded by docling:

❯ ls -la ~/.cache/docling/models/EasyOcr
.rw-r--r--@ 83M smoya 15 May 12:22 craft_mlt_25k.pth
.rw-r--r--@ 15M smoya 15 May 12:22 english_g2.pth
.rw-r--r--@ 15M smoya 15 May 12:22 latin_g2.pth

Models being duplicated by EasyOCR prior to this fix:

❯ ls -la ~/.EasyOCR/model
.rw-r--r--@ 83M smoya 15 May 12:23 craft_mlt_25k.pth
.rw-r--r--@ 15M smoya 15 May 12:23 latin_g2.pth

Wondering if we would need to also include models for other languages in advance, as I can see we only download english and latin.

@smoya smoya requested a review from a team as a code owner May 15, 2025 11:36
@smoya smoya temporarily deployed to internal-contributors May 15, 2025 11:36 — with GitHub Actions Inactive
assert "Use cases include providing chatbot" in chunks_str

# electromagnetic_radiation.docx
assert "All forms of EMR travel at the speed of light in a vacuum" in chunks_str
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this test and save some KB but I added it in order to prove no extra models were being downloaded during the parsing.

Comment on lines +121 to +127
basic_pipeline_options = PdfPipelineOptions(
do_ocr=False, # we do not want to do OCR in PDF (yet)
artifacts_path=self.cache_dir if os.path.isdir(self.cache_dir) else None,
) # pyright: ignore[reportCallIssue]

with_ocr_pipeline_options = basic_pipeline_options
with_ocr_pipeline_options.do_ocr = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we actually enable ocr in pdfs? I thought it was enabled tbh

Copy link
Contributor Author

@smoya smoya May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to leave this option disabled due to the fact it makes PDF parsing process way longer and it is not very clear the user intent is to run ocr on the images inside the PDF.
I think we should make it configurable from the parser when creating a vectorizer. Either keep it disabled by default or enabled, still can be a discussion we want to retake.

Copy link
Member

@JamesGuthrie JamesGuthrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to rename the title of this commit. Based on the actual changes it looks like it should be something like: feat: use ocr on images?

Copy link
Member

@JamesGuthrie JamesGuthrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore my previous comment, LGTM.

@smoya smoya temporarily deployed to internal-contributors May 15, 2025 12:29 — with GitHub Actions Inactive
@smoya smoya merged commit 647985e into main May 15, 2025
14 checks passed
@smoya smoya deleted the sergio/reuse-ocr-models branch May 15, 2025 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants