-
Notifications
You must be signed in to change notification settings - Fork 274
fix: reuse easyocr models from docling cache dir #743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
assert "Use cases include providing chatbot" in chunks_str | ||
|
||
# electromagnetic_radiation.docx | ||
assert "All forms of EMR travel at the speed of light in a vacuum" in chunks_str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this test and save some KB but I added it in order to prove no extra models were being downloaded during the parsing.
basic_pipeline_options = PdfPipelineOptions( | ||
do_ocr=False, # we do not want to do OCR in PDF (yet) | ||
artifacts_path=self.cache_dir if os.path.isdir(self.cache_dir) else None, | ||
) # pyright: ignore[reportCallIssue] | ||
|
||
with_ocr_pipeline_options = basic_pipeline_options | ||
with_ocr_pipeline_options.do_ocr = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we actually enable ocr in pdfs? I thought it was enabled tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to leave this option disabled due to the fact it makes PDF parsing process way longer and it is not very clear the user intent is to run ocr on the images inside the PDF.
I think we should make it configurable from the parser when creating a vectorizer. Either keep it disabled by default or enabled, still can be a discussion we want to retake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to rename the title of this commit. Based on the actual changes it looks like it should be something like: feat: use ocr on images
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore my previous comment, LGTM.
This PR configures docling parser for reusing the EasyOCR models downloaded (included in the docker image) when running OCR in image files.
Without this fix, models were being downloaded again (duplicated) in the
~/.EasyOCR/model
dir whenever docling needed to run OCR on image files (note that we don't enable OCR in PDF files).Models downloaded by docling:
Models being duplicated by EasyOCR prior to this fix:
❯ ls -la ~/.EasyOCR/model .rw-r--r--@ 83M smoya 15 May 12:23 craft_mlt_25k.pth .rw-r--r--@ 15M smoya 15 May 12:23 latin_g2.pth
Wondering if we would need to also include models for other languages in advance, as I can see we only download english and latin.