Skip to content

Manual reprocess required for PDF screenshots to work #1096

@vhsdream

Description

@vhsdream

Describe the Bug

I've just cloned Main to test out updating to v0.23 for the Proxmox LXC version. I'm not sure if this is a bug, or perhaps I'm impatient, but when I add a PDF, the OCR job runs, but the screenshot gen job does not. The PDFs look like this:
Image

Then I go into Admin settings and trigger a reprocess and the screenshot is generated:
Image

Steps to Reproduce

  1. Install/update Hoarder to latest, based on Main
  2. Add a PDF
  3. Wait and refresh the page
  4. Trigger a manual reprocess job then see the image generated

Expected Behaviour

Unless I'm misunderstanding how it's supposed to work, I was thinking that upon adding a PDF, multiple jobs would be running; at least one for the OCR and another for the screenshot gen.

Screenshots or Additional Context

I'm not using the Docker version, but a Proxmox LXC install using the script we created.

I've broken the log output into sections, but there is nothing left out, it's just to note when certain events occur as a result of my actions.

The new dependencies are installed:

root@hoarder-v023:~# dpkg -l | grep ghostscript
ii  ghostscript                     10.0.0~dfsg-11+deb12u6              amd64        interpreter for the PostScript language and for PDF
root@hoarder-v023:~# dpkg -l | grep graphicsmagick
ii  graphicsmagick                  1.4+really1.3.40-4                  amd64        collection of image processing tools
ii  libgraphicsmagick-q16-3         1.4+really1.3.40-4                  amd64        format-independent image processing - C shared library

Adding a PDF:

Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.350Z info: [Crawler] Connecting to existing browser instance: http://127.0.0.1:9222
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.350Z info: [Crawler] Successfully resolved IP address, new address: http://127.0.0.1:9222/
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting crawler worker ...
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting inference worker ...
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting search indexing worker ...
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting tidy assets worker ...
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting video worker ...
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.400Z info: Starting feed worker ...
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.400Z info: Starting asset preprocessing worker ...
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.400Z info: Starting webhook worker ...
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.583Z info: [Crawler][69] Will crawl "https://getsamplefiles.com/download/pdf/sample-1.pdf" for link with id "df6uiumrmz6mluy1bq9r26zf"
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.584Z info: [Crawler][69] Attempting to determine the content-type for the url https://getsamplefiles.com/download/pdf/sample-1.pdf
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.627Z info: [webhook][71] Starting a webhook job for bookmark with id "df6uiumrmz6mluy1bq9r26zf"
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.627Z info: [webhook][71] Completed successfully
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.634Z info: [search][70] Attempting to index bookmark with id df6uiumrmz6mluy1bq9r26zf ...
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.704Z info: [search][70] Completed successfully
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.710Z info: [Crawler][69] Content-type for the url https://getsamplefiles.com/download/pdf/sample-1.pdf is "application/pdf"
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.710Z info: [Crawler][69] Downloading pdf from "https://getsamplefiles.com/download/pdf/sample-1.pdf"
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.731Z info: [Crawler][69] Downloaded pdf as assetId: 3ff23b35-95c1-444c-b6a9-73146ce01a44
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.742Z info: [Crawler][69] Completed successfully
Mar 06 19:08:04 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:04.639Z info: [assetPreprocessing][72] Starting an asset preprocessing job for bookmark with id "df6uiumrmz6mluy1bq9r26zf"
Mar 06 19:08:04 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:04.642Z info: [assetPreprocessing][72] Attempting to extract text from pdf.
Mar 06 19:08:04 hoarder-v023 pnpm[12413]: Warning: Setting up fake worker.
Mar 06 19:08:04 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:04.840Z info: [assetPreprocessing][72] Extracted 2212 characters from pdf.
Mar 06 19:08:04 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:04.850Z info: [assetPreprocessing][72] Completed successfully
Mar 06 19:08:05 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:05.629Z debug: [inference][73] No inference client configured, nothing to do now
Mar 06 19:08:05 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:05.629Z info: [inference][73] Completed successfully
Mar 06 19:08:05 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:05.789Z info: [search][74] Attempting to index bookmark with id df6uiumrmz6mluy1bq9r26zf ...
Mar 06 19:08:05 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:05.924Z info: [search][74] Completed successfully

Manually triggering a reprocessing job from the Admin console:

Mar 06 19:09:06 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:06.948Z info: [assetPreprocessing][75] Starting an asset preprocessing job for bookmark with id "df6uiumrmz6mluy1bq9r26zf"
Mar 06 19:09:06 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:06.952Z info: [assetPreprocessing][75] Skipping PDF text extraction as it's already been extracted.
Mar 06 19:09:06 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:06.952Z info: [assetPreprocessing][75] Attempting to generate PDF screenshot for bookmarkId: df6uiumrmz6mluy1bq9r26zf
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.293Z info: [assetPreprocessing][75] Successfully saved PDF screenshot to database
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.295Z info: [assetPreprocessing][75] Completed successfully
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.298Z info: [assetPreprocessing][76] Starting an asset preprocessing job for bookmark with id "jw045iecs73tcp72cs90xtwz"
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.299Z info: [assetPreprocessing][76] Skipping PDF text extraction as it's already been extracted.
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.299Z info: [assetPreprocessing][76] Skipping PDF screenshot generation as it's already been generated.
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.299Z info: [assetPreprocessing][76] Completed successfully
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.727Z debug: [inference][77] No inference client configured, nothing to do now
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.728Z info: [inference][77] Completed successfully
Mar 06 19:09:08 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:08.032Z info: [search][78] Attempting to index bookmark with id df6uiumrmz6mluy1bq9r26zf ...
Mar 06 19:09:08 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:08.106Z info: [search][78] Completed successfully

Device Details

Firefox latest Arch Linux

Exact Hoarder Version

Pulled from Main

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpri/highHigh priority issuestatus/approvedThis issue is ready to be implemented

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions