[Feature request] Run OCR on images in PDFs to extract text

**Is your feature request related to a problem? Please describe.**
Would be nice to have the ability to extract text from images embedded in PDFs.

**Describe the solution you'd like**
Ability to extract text from images in PDFs, such as if the PDF is a slide deck of images. This might be something we could configure with a toggle switch or a list so that this isn't run by default, since it will likely be computationally expensive to do both text extraction as well as OCR.

**Describe alternatives you've considered**
https://evermap.com/Tutorial_ABM_OCR.asp describes a way to make OCR documents with Adobe Acrobat. I believe you can also do this with tools like Readiris that OCR in multiple languages.

**Additional context**
Some PDFs may contain diagrams or other images with text in them that can be useful to extract. We already have OCR support for images so it may be an idea to extract the images from the PDF and run OCR on them, then combine this with the existing text extraction results.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature request] Run OCR on images in PDFs to extract text #20

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Feature request] Run OCR on images in PDFs to extract text #20

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions