WIP: Bookmark embeddings #834

MohamedBassem · 2025-01-05T18:53:49Z

No description provided.

cloudflare-workers-and-pages · 2025-01-05T18:55:43Z

Deploying hoarder-docs with Cloudflare Pages

Latest commit:	`1b52710`
Status:	✅ Deploy successful!
Preview URL:	https://5bd08d4e.hoarder.pages.dev
Branch Preview URL:	https://bookmark-embeddings.hoarder.pages.dev

View logs

cloudflare-workers-and-pages · 2025-01-05T18:58:10Z

Deploying hoarder-landing with Cloudflare Pages

Latest commit:	`1b52710`
Status:	✅ Deploy successful!
Preview URL:	https://7b4c917a.hoarder-landing.pages.dev
Branch Preview URL:	https://bookmark-embeddings.hoarder-landing.pages.dev

View logs

thiswillbeyourgithub · 2025-05-17T15:46:13Z

Just linking #1315 here as I think it might be better to make it part of the first release that contains embeddings, otherwise we'll have to deal with the issue of resetting all the embeddings when switching to binary or not.

thiswillbeyourgithub · 2025-05-24T22:38:52Z

I hope you won't mind but I was wondering what were the blockers for this much awaited feature. What is the state of the code, what are questions left to answer, what can the community do to help, etc.

Because it's a major change, I'm afraid no one other than the maintainer would want to tackle it and so I think communication here would go a long way to make us help you without duplicating too much effort.

MohamedBassem · 2025-05-25T12:14:06Z

@thiswillbeyourgithub So generating the embeddings currently work. The biggest blocker for this (beside my time) is me choosing what vector storage service to use. I initially was planning on sqlite-vec but the project has been in pre-alpha for a while and it's not clear to me if it'll exit this status. Meillisearch is another candidate, however, for a long period of time, the meillisearch version in karakeep was the one before the addition of vector search, so a lot of user will have to upgrade meillisearch to make use of embeddings. So I'll need to check if there's a backward compatible way for doing this.

Another caveat is that a lot of ollama users don't have an embeddings model set, so I'll need a way to communicate that as well.

Nothing major left, but a lot of tiny things for a smooth rollout of such a feature.

thiswillbeyourgithub · 2025-05-25T15:28:15Z

I don't think that the lack of embedding model on most user's setup is an issue because ollama use is opt in. As long as you haven't set an embedding model, you won't have embeddings search and that's fine.

Fwiw: on ollama i've been happy with arctic embed 2 with quantization (not enabled by default, but ends up with a 450mb model or so) link. I would gladely contribute my Modelfile to help others set this up.

Personaly, my money is on meillisearch. If you don't mind (i.e. don't tell me shortly not to do so) I'll take a look on their repo and maybe ask them directly for guidance.

Also I'd say that for a smooth rollout it might be good to do a pre release/alpha/beta for a few weeks/month beforehand to iron out the misshaps.

Edit: I took a look at meilisearch: what do you think of a docker one liner using alpine to execute an rm command on the meillisearch db then asking the user to force reindexing the db in karakeep?

Edit: after reading issue 2570 of meillisearch (not linking to avoid making a fuss) I kinda get the feeling that they don't want to make the upgrade easy because they view "easy upgrade" as their moat. Not very reassuring but still a good product. I think the alpine way is a good solution. Or using a new volume entirely.

thiswillbeyourgithub · 2025-05-27T01:58:02Z

I was suddenly reminded of that prominent quote in the readme:

This app is under heavy development and it's far from stable.

So my opinion is now that, as long as there is no data loss (ever!), this disclaimer grants you the moral right to move faster on those roll out situations.

ollama users will figure it out
us contributors will help with PRs left and right to iron out the time consuming kinks (docs, guides, UI, perf, etc)
this kind of major feature that touches both frontend and backend seems (to me) at this stage best handled by the maintainer, saving coordination overhead
with this disclaimer: the goal is to move fast to reach quickly the "stable" state, as a lot of users are probably hesitant to join as long as the disclaimer is in place

Just my 2 cents on priorities of course. I'm aware you're very busy and am sure you'll do what's best for us all.

paul-tharun · 2025-06-26T22:59:32Z

We are using the same OpenAI api-key and base url for both inference and embedding. I think have separate config for each of them would be better, as it allows for using different providers for each of these tasks. In addition, this also would remove the limitation of having to use a provider who supports both embeddings and chat.

thiswillbeyourgithub · 2025-06-27T06:04:56Z

We are using the same OpenAI api-key and base url for both inference and embedding. I think have separate config for each of them would be better, as it allows for using different providers for each of these tasks. In addition, this also would remove the limitation of having to use a provider who supports both embeddings and chat.

I sort of disagree as for more cutstomization you should instead run a litellm instance

paul-tharun · 2025-06-27T15:22:51Z

We are using the same OpenAI api-key and base url for both inference and embedding. I think have separate config for each of them would be better, as it allows for using different providers for each of these tasks. In addition, this also would remove the limitation of having to use a provider who supports both embeddings and chat.

I sort of disagree as for more cutstomization you should instead run a litellm instance

Litellm is one of the llm gateway providers that support embeddings , there are others that do not support embeddings but solely focus on llm inference (openrouter, cloudflare). Adding this would be good separation of concerns imo, if needed we can default to the inference parameters if the embedding params are not set.

MohamedBassem added 5 commits December 29, 2024 22:55

lancedb based embeddings generation

01c74df

Merge branch 'main' into bookmark-embeddings

bee1d27

update pnpm lock file

3a9833b

Revert the schema change

0956901

Fix typechecks

1b52710

MohamedBassem mentioned this pull request Jan 19, 2025

feat(webhook): Implement webhook functionality for bookmark events #852

Merged

austinmccalley mentioned this pull request Jan 27, 2025

[Feature request] Selfhosted semantic search #441

Open

MohamedBassem force-pushed the main branch from 8673363 to d42c196 Compare April 27, 2025 23:49

This was referenced May 18, 2025

[Feature request] Force AI to use existing tags (instead of creating them) #111

Open

FR: "see similar bookmarks" / "reading suggestion" using embeddings #1460

Open

MohamedBassem marked this pull request as draft June 22, 2025 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

WIP: Bookmark embeddings #834

WIP: Bookmark embeddings #834

Uh oh!

MohamedBassem commented Jan 5, 2025

Uh oh!

cloudflare-workers-and-pages bot commented Jan 5, 2025

Uh oh!

cloudflare-workers-and-pages bot commented Jan 5, 2025

Uh oh!

thiswillbeyourgithub commented May 17, 2025

Uh oh!

thiswillbeyourgithub commented May 24, 2025

Uh oh!

MohamedBassem commented May 25, 2025

Uh oh!

thiswillbeyourgithub commented May 25, 2025 •

edited

Loading

Uh oh!

thiswillbeyourgithub commented May 27, 2025

Uh oh!

paul-tharun commented Jun 26, 2025

Uh oh!

thiswillbeyourgithub commented Jun 27, 2025 •

edited

Loading

Uh oh!

paul-tharun commented Jun 27, 2025

Uh oh!

Uh oh!

Uh oh!

WIP: Bookmark embeddings #834

Are you sure you want to change the base?

WIP: Bookmark embeddings #834

Uh oh!

Conversation

MohamedBassem commented Jan 5, 2025

Uh oh!

cloudflare-workers-and-pages bot commented Jan 5, 2025

Deploying hoarder-docs with Cloudflare Pages

Uh oh!

cloudflare-workers-and-pages bot commented Jan 5, 2025

Deploying hoarder-landing with Cloudflare Pages

Uh oh!

thiswillbeyourgithub commented May 17, 2025

Uh oh!

thiswillbeyourgithub commented May 24, 2025

Uh oh!

MohamedBassem commented May 25, 2025

Uh oh!

thiswillbeyourgithub commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thiswillbeyourgithub commented May 27, 2025

Uh oh!

paul-tharun commented Jun 26, 2025

Uh oh!

thiswillbeyourgithub commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paul-tharun commented Jun 27, 2025

Uh oh!

Uh oh!

thiswillbeyourgithub commented May 25, 2025 •

edited

Loading

thiswillbeyourgithub commented Jun 27, 2025 •

edited

Loading