-
-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Is your feature request related to a problem? Please describe.
Would be nice to have the ability to extract text from images embedded in PDFs.
Describe the solution you'd like
Ability to extract text from images in PDFs, such as if the PDF is a slide deck of images. This might be something we could configure with a toggle switch or a list so that this isn't run by default, since it will likely be computationally expensive to do both text extraction as well as OCR.
Describe alternatives you've considered
https://evermap.com/Tutorial_ABM_OCR.asp describes a way to make OCR documents with Adobe Acrobat. I believe you can also do this with tools like Readiris that OCR in multiple languages.
Additional context
Some PDFs may contain diagrams or other images with text in them that can be useful to extract. We already have OCR support for images so it may be an idea to extract the images from the PDF and run OCR on them, then combine this with the existing text extraction results.