Skip to content

Pages that failed to scan end up missing entirely from the index - should have rows with blank text instead #23

@simonw

Description

@simonw

Original title: s3-ocr index not catching every page - 84 out of 102

Spotted while working on:

This command runs against a bucket with 102 PDFs in, all of which have been OCRd:

s3-ocr index s3-ocr-many-pdfs /tmp/many.db

The resulting DB looks like this:

(s3-ocr) s3-ocr % sqlite-utils tables --counts /tmp/many.db 
[{"table": "pages", "count": 84},
 {"table": "pages_fts", "count": 84},
 {"table": "pages_fts_data", "count": 8},
 {"table": "pages_fts_idx", "count": 6},
 {"table": "pages_fts_docsize", "count": 84},
 {"table": "pages_fts_config", "count": 1},
 {"table": "ocr_jobs", "count": 102},
 {"table": "fetched_jobs", "count": 102}]

The pages table should have 102 records in it, not 84.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions