Skip to content

[BUG] Inference task that takes a really long time to pre-process (>10 minutes) caused by slow tokenization #1622

@pdc1

Description

@pdc1

Describe the Bug

I have a bookmark that causes an inference task that never ends, with a node process that is stuck using 100% of one processor. The task never ends and never appears to time out. To recover, I have to delete the queue.db and restart the container.

The bookmark was originally imported in 0.24.1, which did not have this issue. I am guessing that the error is in the new tokenization, with the exception not properly caught by the processing thread, but that's just a guess.

The logs look like this:

2025-06-16T17:24:48.341Z �[32minfo�[39m: [Crawler][11] Will crawl "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/" for link with id "a6iubzxc6diiqp1a005jsgxe"
2025-06-16T17:24:48.342Z �[32minfo�[39m: [Crawler][11] Attempting to determine the content-type for the url https://www.xda-developers.com/games-that-justify-ray-tracing-tax/
2025-06-16T17:24:48.635Z �[32minfo�[39m: [Crawler][11] Content-type for the url https://www.xda-developers.com/games-that-justify-ray-tracing-tax/ is "text/html; charset=UTF-8"
2025-06-16T17:24:51.800Z �[32minfo�[39m: [Crawler][11] Successfully navigated to "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/". Waiting for the page to load ...
2025-06-16T17:24:53.581Z �[32minfo�[39m: [Crawler][11] Finished waiting for the page to load.
2025-06-16T17:24:53.891Z �[32minfo�[39m: [Crawler][11] Successfully fetched the page content.
2025-06-16T17:24:54.665Z �[32minfo�[39m: [Crawler][11] Finished capturing page content and a screenshot. FullPageScreenshot: false
2025-06-16T17:24:54.676Z �[32minfo�[39m: [Crawler][11] Will attempt to extract metadata from page ...
2025-06-16T17:25:02.020Z �[32minfo�[39m: [Crawler][11] Will attempt to extract readable content ...
Error: Could not parse CSS stylesheet
    at exports.createStylesheet (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/helpers/stylesheets.js:37:21)
    at HTMLStyleElementImpl._updateAStyleBlock (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:68:5)
    at HTMLStyleElementImpl._poppedOffStackOfOpenElements (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:42:10)
    at JSDOMParse5Adapter.onItemPop (/app/apps/workers/node_modules/.pnpm/jsdom@24.0.0/node_modules/jsdom/lib/jsdom/browser/parser/html.js:175:43)
    at Parser.onItemPop (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:158:90)
    at OpenElementStack.pop (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/open-element-stack.js:89:22)
    at endTagInText (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:2287:20)
    at Parser._endTagOutsideForeignContent (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:931:17)
    at Parser.onEndTag (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/parser/index.js:897:18)
    at Tokenizer.emitCurrentTagToken (/app/apps/workers/node_modules/.pnpm/parse5@7.1.2/node_modules/parse5/dist/cjs/tokenizer/index.js:402:26) 
            .responsive-img{position:relative;overflow:hidden}.responsive-img img{position:absolute;top:0;left:0;width:100%;height:100%}

[... 800KB of CSS data(!) omitted...]

 .swiper-slide-shadow-bottom,.swiper-container-flip .swiper-slide-shadow-left,.swiper-container-flip .swiper-slide-shadow-right,.swiper-container-flip .swiper-slide-shadow-top{z-index:0;backface-visibility:hidden}.swiper-container-coverflow .swiper-wrapper{-ms-perspective:1200px}
        
2025-06-16T17:25:07.227Z �[32minfo�[39m: [Crawler][11] Done extracting readable content.
2025-06-16T17:25:07.285Z �[32minfo�[39m: [Crawler][11] Stored the screenshot as assetId: cca04d75-bf3b-4d11-8287-860d0281c22e
2025-06-16T17:25:07.394Z �[32minfo�[39m: [Crawler][11] Done extracting metadata from the page.
2025-06-16T17:25:07.394Z �[32minfo�[39m: [Crawler][11] Downloading image from "https://static1.xdaimages.com/wordpress/wp-content/uploads/2024/05/awii_launch_049.png"
2025-06-16T17:25:08.275Z �[32minfo�[39m: [Crawler][11] Downloaded image as assetId: f4e149a7-1e83-4d86-adb6-c77418e2b322
2025-06-16T17:25:08.408Z �[32minfo�[39m: [Crawler][11] Completed successfully
2025-06-16T17:25:09.299Z �[32minfo�[39m: [inference][12] Starting an inference job for bookmark with id "a6iubzxc6diiqp1a005jsgxe"

Steps to Reproduce

  1. Add a bookmark for "https://www.xda-developers.com/games-that-justify-ray-tracing-tax/"
  2. I think that's it?

Expected Behaviour

Content is processed or produces an error.

Screenshots or Additional Context

I can provide the full log file if it will help.

Device Details

No response

Exact Karakeep Version

0.25.0

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpri/highHigh priority issuestatus/approvedThis issue is ready to be implemented

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions