-
Notifications
You must be signed in to change notification settings - Fork 593
feat: remove get_distance_from_collection_names and adapt search #766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: remove get_distance_from_collection_names and adapt search #766
Conversation
Please make sure all the checkboxes are checked:
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis change refactors multiple vector database adapters and related modules to unify and modernize search and collection access logic. The deprecated Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant Adapter
participant DBClient
Caller->>Adapter: search(collection_name, query_text, query_vector, limit)
Adapter->>Adapter: get_collection(collection_name)
Adapter->>DBClient: retrieve collection
alt Collection exists
Adapter->>DBClient: perform search (limit, query_text, query_vector)
DBClient-->>Adapter: search results
Adapter-->>Caller: normalized results
else Collection missing
Adapter-->>Caller: raise CollectionNotFoundError
end
Possibly related PRs
Suggested labels
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (7)
cognee/modules/graph/cognee_graph/CogneeGraph.py (1)
124-135
: 🛠️ Refactor suggestion
⚠️ Potential issueUnused embedding computation & wrong argument – avoid double work
An embedding is generated (
query_vector
) but never passed tosearch
; instead,query_text
is supplied, forcing the adapter to re‑embed the same string. This doubles compute cost and latency.- query_vector = await vector_engine.embed_data([query]) - query_vector = query_vector[0] - if query_vector is None or len(query_vector) == 0: - raise ValueError("Failed to generate query embedding.") - - edge_distances = await vector_engine.search( - collection_name="EdgeType_relationship_name", - query_text=query, - limit=0, - ) + query_vector = (await vector_engine.embed_data([query]))[0] + if not query_vector: + raise ValueError("Failed to generate query embedding.") + + edge_distances = await vector_engine.search( + collection_name="EdgeType_relationship_name", + query_vector=query_vector, # reuse pre‑computed embedding + limit=0, + )Side note:
limit=0
is translated toNone
inside the adapter. Confirm that this yields the intended “unlimited results” behaviour with your backend.cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (1)
148-165
:⚠️ Potential issue
retrieve()
returns raw dicts, breaking the interface contract
VectorDBInterface.retrieve()
is expected to returnList[ScoredResult]
(or equivalent), but this implementation returns Milvus‑specific dictionaries. Down‑stream code expectingScoredResult
will break.
Convert the results:- return results + return [ + ScoredResult(id=parse_id(row["id"]), payload=row, score=0) + for row in results + ]cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)
160-177
:⚠️ Potential issue
normalize_distances
gets raw dataframe rows, missing_distance
key
normalize_distances
expects an iterable of dicts containing_distance
; passingresult_values
as produced byto_dict()
means the function cannot find distances and will raise or return nonsense.
Map the results first:-result_values = list(results.to_dict("index").values()) +result_values = [ + { + "id": row["id"], + "payload": row["payload"], + "_distance": row["distance"], + } + for row in results.to_dict("index").values() +]cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1)
80-88
: 🛠️ Refactor suggestionO(N²) vector lookup in
create_data_points
data_points.index(data_point)
inside a loop gives quadratic complexity. Useenumerate
to fetch the already‑computed vector in O(N).- def convert_to_weaviate_data_points(data_point: DataPoint): - vector = data_vectors[data_points.index(data_point)] + def convert_to_weaviate_data_points(idx: int, data_point: DataPoint): + vector = data_vectors[idx] ... -data_points = [convert_to_weaviate_data_points(data_point) for data_point in data_points] +data_points = [ + convert_to_weaviate_data_points(idx, dp) for idx, dp in enumerate(data_points) +]cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (3)
314-317
: 🛠️ Refactor suggestion
get_collection_names
should normalise to a list of stringsReturning the raw objects leaks implementation details and complicates every consumer (see previous bug).
Convert once here and keep the public contract simple.- return await client.list_collections() + # Chroma returns a list of Collection objects / dicts. + collections = await client.list_collections() + return [ + c.name if hasattr(c, "name") else c["name"] + for c in collections + ]After this change you can simplify
has_collection
to:return collection_name in await self.get_collection_names()
184-201
:⚠️ Potential issueParameter
normalized
is unused &limit = 0
may DOS the service
The
normalized
flag is accepted but ignored. Either honour it or drop the parameter.Converting
limit == 0
intoawait collection.count()
can explode to the full collection size, potentially overwhelming memory / network.
Prefer a hard cap or require the caller to pass an explicit positive integer.Example fix:
- if limit == 0: - limit = await collection.count() + if limit == 0: + raise InvalidValueError( + message="Limit must be greater than 0; requesting the whole " + "collection is not supported." + )and use
normalized
when post‑processing:if normalized: normalized_values = normalize_distances(vector_list) else: normalized_values = [1 - item["_distance"] for item in vector_list] # or raw
239-242
: 🛠️ Refactor suggestionCatching bare
Exception
hides programming errorsSwallowing every exception and returning an empty list makes debugging impossible and may produce silent data loss.
Catch only anticipated runtime errors (CollectionNotFoundError
, network time‑outs, etc.) and re‑raise the rest.- except Exception as e: - logger.error(f"Error in search: {str(e)}") - return [] + except CollectionNotFoundError: + raise + except RuntimeError as e: # adjust to concrete chromadb errors + logger.error(f"ChromaDB search failed: {e}") + return []
🧹 Nitpick comments (8)
cognee/modules/retrieval/utils/brute_force_triplet_search.py (1)
145-147
:limit=0
fetches the entire collection – check scalabilityThe adapter interprets
limit=0
as no limit, potentially returning every vector in large collections.
For most ranking scenarios only the top results are needed:- vector_engine.search(collection_name=collection_name, query_text=query, limit=0) + vector_engine.search( + collection_name=collection_name, + query_text=query, + limit=top_k # or another reasonable cap + )Please verify the desired behaviour and adjust to prevent unnecessary load.
cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (1)
183-204
: Duplicateclient.close()
call – tidy up resource handling
await client.close()
is executed inside thetry
and thefinally
block. Double closing is redundant and can raise errors in future client versions.- await client.close() - return [ + return [ ScoredResult( ... ) ]Keep the single close in the
finally
block to centralise cleanup logic.cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)
210-210
:with_vector
parameter is ignoredThe method declares
with_vector
but never includes the vector in the output when it isTrue
. Either honour it (e.g. add avector
field toScoredResult
) or remove the parameter to avoid a misleading API.cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (2)
153-155
: Minor: shadowing the built‑inid
inside f‑string comprehensionWhile harmless, using
id
as the loop variable shadows Python’s built‑inid()
which can confuse readers.", ".join(f'"{dp_id}"' for dp_id in data_point_ids)
193-210
: Consider normalising Milvus distances for consistencyOther adapters map raw distances to a 0‑1 score via
normalize_distances
. Returning bare distances here makes score semantics adapter‑specific and complicates ranking across back‑ends.cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)
165-167
:limit == 0
semantics rely on an extra DB callCalling
await collection.count_rows()
for every unlimited search adds latency. Consider treatinglimit <= 0
asNone
and letting LanceDB handle “no limit”, mirroring other adapters.cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (2)
121-127
: Avoid double round‑trip inget_collection
has_collection()
already hits the server. Immediately callingclient.get_collection()
makes a second request even when the collection is missing.
You can rely on the singleget_collection
call and catch the provider‑specific exception, mapping it toCollectionNotFoundError
, e.g.:- if not await self.has_collection(collection_name): - raise CollectionNotFoundError(f"Collection '{collection_name}' not found!") - - client = await self.get_connection() - return await client.get_collection(collection_name) + client = await self.get_connection() + try: + return await client.get_collection(collection_name) + except Exception: + # Map the provider error to the domain error + raise CollectionNotFoundError( + f"Collection '{collection_name}' not found!" + ) from NoneThis halves latency and network traffic for the common case.
253-261
: Keep defaultlimit
consistent withsearch
search
now defaults tolimit=15
whilebatch_search
still useslimit=5
.
Recommend aligning both to avoid surprising behaviour.- limit: int = 5, + limit: int = 15,
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
cognee-mcp/pyproject.toml
is excluded by!**/*.toml
poetry.lock
is excluded by!**/*.lock
,!**/*.lock
pyproject.toml
is excluded by!**/*.toml
📒 Files selected for processing (9)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py
(8 hunks)cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py
(7 hunks)cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py
(9 hunks)cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py
(1 hunks)cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py
(6 hunks)cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py
(10 hunks)cognee/modules/graph/cognee_graph/CogneeGraph.py
(1 hunks)cognee/modules/retrieval/utils/brute_force_triplet_search.py
(1 hunks)cognee/tests/unit/modules/retrieval/utils/brute_force_triplet_search_test.py
(0 hunks)
💤 Files with no reviewable changes (1)
- cognee/tests/unit/modules/retrieval/utils/brute_force_triplet_search_test.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (5)
cognee/shared/logging_utils.py (2)
get_logger
(137-158)error
(127-128)cognee/infrastructure/engine/utils/parse_id.py (1)
parse_id
(4-10)cognee/exceptions/exceptions.py (1)
InvalidValueError
(38-45)cognee/infrastructure/engine/models/DataPoint.py (1)
DataPoint
(16-96)cognee/infrastructure/databases/vector/exceptions/exceptions.py (1)
CollectionNotFoundError
(5-12)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Deletion Test
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: Basic Tests / Run Unit Tests
- GitHub Check: Basic Tests / Run Simple Examples
- GitHub Check: Basic Tests / Run Basic Graph Tests
- GitHub Check: Basic Tests / Run Integration Tests
🔇 Additional comments (1)
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1)
170-173
:get_collection
may needawait
with async clientIf
weaviate-python
’s async variant makescollections.get()
a coroutine, omittingawait
will return a coroutine object and break later calls. Please confirm the API version and addawait
if required.
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py
Outdated
Show resolved
Hide resolved
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py
Outdated
Show resolved
Hide resolved
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py
Outdated
Show resolved
Hide resolved
…-get_distance_from_collection_names-and-search
…-get_distance_from_collection_names-and-search
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.