Skip to content

Conversation

borisarzentar
Copy link
Member

Description

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

@borisarzentar borisarzentar requested a review from dexters1 May 13, 2025 18:21
@borisarzentar borisarzentar self-assigned this May 13, 2025
Copy link

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@borisarzentar borisarzentar changed the base branch from main to dev May 13, 2025 18:21
Copy link
Contributor

coderabbitai bot commented May 13, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This update introduces significant architectural and functional changes across the codebase. Major highlights include: a multi-stage Docker build using a custom base image and switching from Poetry to uv for dependency management; a unified, centralized observability setup; new support for Memgraph as a graph database provider; improved error handling and refactoring of vector database adapters; the addition of context propagation through pipeline tasks; and updated documentation and test coverage reflecting these core changes.

Changes

Files / Groups Change Summary
Dockerfile, entrypoint.sh Refactored to use a multi-stage build with a custom base image and uv for dependency management; removed Poetry usage; updated entrypoint to use .venv/bin/ executables directly.
README.md, community/README.zh.md, assets/graph_visualization.html, notebooks/github_graph_visualization.html, notebooks/hr_demo.ipynb Updated documentation: new multi-language sections, new UI screenshot, removed graph visualization HTML assets, added Chinese README, deleted HR demo notebook and visualization HTML.
cognee-mcp/src/server.py Refactored to use FastMCP, split tool registration into individual async functions, added status tools, improved background task handling, and updated server startup logic.
cognee/modules/observability/get_observe.py, cognee/modules/observability/observers.py, cognee/base_config.py, cognee/shared/data_models.py, cognee/infrastructure/llm/openai/adapter.py, cognee/infrastructure/llm/gemini/adapter.py Centralized observability setup: introduced Observer enum, get_observe utility, removed MonitoringTool, updated config and LLM adapters to use new observability pattern.
cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py, cognee/infrastructure/databases/graph/get_graph_engine.py, cognee/tests/test_memgraph.py Added Memgraph graph database support: new async adapter, provider selection logic, and integration test.
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py, cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py, cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py, cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py, cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py, cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py Refactored vector adapters: removed deprecated get_distance_from_collection_elements, unified search method with improved error handling, updated method signatures and default limits, centralized collection existence checks.
cognee/modules/data/methods/get_unique_dataset_id.py, cognee/modules/data/methods/create_dataset.py, cognee/modules/data/methods/init.py, cognee/api/v1/cognify/code_graph_pipeline.py, cognee/modules/pipelines/operations/pipeline.py, cognee/tasks/ingestion/ingest_data.py Introduced get_unique_dataset_id for dataset ID generation; updated dataset creation and pipeline logic to use user context and async ID generation.
cognee/modules/pipelines/operations/run_tasks.py, cognee/modules/pipelines/operations/run_tasks_base.py, cognee/tests/unit/modules/pipelines/run_tasks_with_context_test.py Added context propagation through pipeline tasks: updated function signatures, injected context into tasks, and added new unit test for context-aware pipelines.
cognee/modules/retrieval/exceptions/exceptions.py, cognee/modules/retrieval/exceptions/init.py, cognee/modules/retrieval/utils/brute_force_triplet_search.py, cognee/modules/retrieval/graph_completion_retriever.py, cognee/tests/unit/modules/retrieval/utils/brute_force_triplet_search_test.py Removed CollectionDistancesNotFoundError, improved error handling in retrieval utilities, updated tests to reflect new exception handling.
cognee/exceptions/exceptions.py, cognee/infrastructure/databases/vector/exceptions/exceptions.py Extended error classes with logging controls and log level parameters; updated constructor signatures for more granular logging.
cognee/infrastructure/databases/graph/networkx/adapter.py Updated method signatures to use UUID for node identifiers instead of str.
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py Hardcoded schema list for table deletion, updated comments, and simplified logic.
cognee/modules/graph/cognee_graph/CogneeGraph.py Replaced vector distance method with unified search call and explicit parameters.
cognee/shared/logging_utils.py Broadened SQLAlchemy warning suppression to apply above DEBUG log level.
cognee-frontend/src/utils/fetch.ts Changed fetch base URL from 127.0.0.1 to localhost.
examples/data/car_and_tech_companies.txt Added new sample data file with company descriptions.
notebooks/cognee_demo.ipynb, notebooks/cognee_graphiti_demo.ipynb, notebooks/cognee_llama_index.ipynb, notebooks/cognee_simple_demo.ipynb, notebooks/graphrag_vs_rag.ipynb, notebooks/llama_index_cognee_integration.ipynb Updated code samples for new APIs, user context, and improved metadata formatting; upgraded package versions.
examples/python/graphiti_example.py Updated to fetch and pass default user to tasks.
cognee/tests/integration/run_toy_tasks/conftest.py Deleted pytest fixture for copying test database.
cognee/tests/test_neo4j.py, cognee/tests/test_weaviate.py, cognee/tests/unit/modules/pipelines/run_tasks_test.py, cognee/tests/unit/modules/retrieval/chunks_retriever_test.py, cognee/tests/unit/modules/retrieval/graph_completion_retriever_test.py, cognee/tests/unit/modules/retrieval/summaries_retriever_test.py Updated tests for new APIs, async patterns, and improved result assertions; removed or consolidated redundant tests.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Pipeline
    participant Task
    participant Context

    User->>Pipeline: run_tasks(tasks, data, user, context)
    Pipeline->>Task: for each task, handle_task(task, args, context)
    Task->>Task: if executable accepts context, pass context
    Task->>Context: use context in execution (optional)
    Task-->>Pipeline: return result
    Pipeline-->>User: yield result
Loading
sequenceDiagram
    participant API
    participant MemgraphAdapter
    participant MemgraphDB

    API->>MemgraphAdapter: create_graph_engine(provider="memgraph", credentials)
    MemgraphAdapter->>MemgraphDB: connect using Neo4j async driver
    API->>MemgraphAdapter: add_node / add_edge / query / get_metrics
    MemgraphAdapter->>MemgraphDB: execute Cypher queries asynchronously
    MemgraphDB-->>MemgraphAdapter: return results
    MemgraphAdapter-->>API: return processed data
Loading

Possibly related PRs

  • #766: Refactors vector DB adapters to remove get_distance_from_collection_elements and unify search methods, directly related to the vector adapter changes here.
  • #788: Adds context propagation to run_tasks and related pipeline functions, matching the context-handling refactor in this PR.
  • #815: Modifies the Dockerfile for dependency management and build process, related to the multi-stage Dockerfile and uv migration in this update.

Poem

In the warren, code hops anew,
With Docker rebuilt and a Memgraph view.
Context now flows through each pipeline task,
Observers watch quietly, no need to ask.
Vector searches unified, errors now clear—
A garden of features for all bunnies here!
🐇✨


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 14

🔭 Outside diff range comments (8)
cognee/exceptions/exceptions.py (1)

34-64: 🛠️ Refactor suggestion

Update subclasses to use new parameters

The subclasses (ServiceError, InvalidValueError, InvalidAttributeError) should be updated to pass the new parameters to the parent class.

Update the subclasses to include the new parameters:

 class ServiceError(CogneeApiError):
     """Failures in external services or APIs, like a database or a third-party service"""
 
     def __init__(
         self,
         message: str = "Service is unavailable.",
         name: str = "ServiceError",
         status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
+        log=True,
+        log_level="ERROR",
     ):
-        super().__init__(message, name, status_code)
+        super().__init__(message, name, status_code, log, log_level)

Apply similar changes to InvalidValueError and InvalidAttributeError.

cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (2)

48-54: 🛠️ Refactor suggestion

Avoid synchronous DB calls inside the event-loop

client.has_collection() is a blocking, synchronous SDK call that is executed directly inside an async method.
Running blocking I/O in the event-loop will freeze other coroutines while the call is in progress.

-        future = asyncio.Future()
-        client = self.get_milvus_client()
-        future.set_result(client.has_collection(collection_name=collection_name))
-
-        return await future
+        loop = asyncio.get_running_loop()
+        client = self.get_milvus_client()
+        return await loop.run_in_executor(
+            None, lambda: client.has_collection(collection_name=collection_name)
+        )

99-115: 🛠️ Refactor suggestion

Potential event-loop blocking in create_data_points

client.insert() is another synchronous SDK call executed directly. For large batches this can stall the entire asyncio application.

Consider delegating the insertion to a thread-pool:

-            result = client.insert(collection_name=collection_name, data=insert_data)
+            loop = asyncio.get_running_loop()
+            result = await loop.run_in_executor(
+                None, lambda: client.insert(collection_name=collection_name, data=insert_data)
+            )
cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (2)

100-126: ⚠️ Potential issue

upload_points is never awaited – points are not stored

AsyncQdrantClient.upload_points() returns a coroutine that must be awaited.
Without the await, the upload is silently skipped and the surrounding try
block will never raise the expected UnexpectedResponse.

-            client.upload_points(collection_name=collection_name, points=points)
+            await client.upload_points(collection_name=collection_name, points=points)

163-205: 🛠️ Refactor suggestion

Double client-close and unreachable finally

await client.close() appears both:

  1. inside the try block (line 183)
  2. again in the finally clause (line 204)

The second call will attempt to close an already-closed connection and can
raise RuntimeError: Event loop is closed under heavy load.

Remove the close inside the try block and keep the finally guard.

-            await client.close()
cognee/infrastructure/databases/graph/networkx/adapter.py (2)

108-113: 🛠️ Refactor suggestion

Edge/node ID inconsistency will create orphan edges

add_edges() converts UUIDs to strings before storage, yet every read-side API (get_edges, has_node, etc.) now expects raw UUID objects.
As a result:

  1. has_edge() will succeed (edge keys are strings) while has_node() uses UUIDs.
  2. get_edges(UUID) will likely return an empty list because the underlying graph stores the string version.

Store IDs in a single, canonical form (preferably UUID) everywhere or perform explicit conversion on both write and read paths.

Also applies to: 139-143


303-304: ⚠️ Potential issue

node_link_data extra argument is invalid

nx.readwrite.json_graph.node_link_data does not take an edges= parameter; unexpected kwargs raise TypeError on recent NetworkX versions.

-graph_data = nx.readwrite.json_graph.node_link_data(self.graph, edges="links")
+graph_data = nx.readwrite.json_graph.node_link_data(self.graph)
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1)

90-97: 🛠️ Refactor suggestion

Quadratic look-up when pairing embeddings with datapoints

Using data_points.index(data_point) inside a list-comprehension yields O(n²) complexity and gives the wrong vector when duplicate instances appear.

-def convert_to_weaviate_data_points(data_point: DataPoint):
-    vector = data_vectors[data_points.index(data_point)]
+def convert_to_weaviate_data_points(pair):
+    data_point, vector = pair
@@
-data_points = [convert_to_weaviate_data_points(data_point) for data_point in data_points]
+data_points = [
+    convert_to_weaviate_data_points(p)
+    for p in zip(data_points, data_vectors)
+]
♻️ Duplicate comments (1)
cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py (1)

82-88: Same issue in bulk-insert query

n:node.label suffers from the same dynamic-label problem and will attempt to create the literal label node.label.

Ensure labels are interpolated before sending the Cypher and properly validated.

🧹 Nitpick comments (16)
cognee/modules/retrieval/exceptions/__init__.py (1)

7-7: Add imported exceptions to __all__ or use them directly

The imported exceptions SearchTypeNotSupported and CypherSearchError are currently unused in this module.

Either add them to an __all__ list to make them available to users of this module, or use a redundant alias import pattern. For example:

from .exceptions import SearchTypeNotSupported, CypherSearchError

+ __all__ = ["SearchTypeNotSupported", "CypherSearchError"]
🧰 Tools
🪛 Ruff (0.8.2)

7-7: .exceptions.SearchTypeNotSupported imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


7-7: .exceptions.CypherSearchError imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

cognee/exceptions/exceptions.py (2)

15-16: Great addition of logging control parameters

Adding log and log_level parameters provides better flexibility for controlling error logging behavior.

Consider using an enum for log_level instead of string literals to prevent potential typos and provide better IDE support.


23-30: Well-implemented conditional logging

The conditional logging implementation correctly handles different log levels, providing granular control over how exceptions are logged.

A small optimization would be to use a dictionary mapping or match/case statement (if using Python 3.10+) instead of multiple if-elif conditions:

-        if log and (log_level == "ERROR"):
-            logger.error(f"{self.name}: {self.message} (Status code: {self.status_code})")
-        elif log and (log_level == "WARNING"):
-            logger.warning(f"{self.name}: {self.message} (Status code: {self.status_code})")
-        elif log and (log_level == "INFO"):
-            logger.info(f"{self.name}: {self.message} (Status code: {self.status_code})")
-        elif log and (log_level == "DEBUG"):
-            logger.debug(f"{self.name}: {self.message} (Status code: {self.status_code})")
+        if log:
+            log_msg = f"{self.name}: {self.message} (Status code: {self.status_code})"
+            if log_level == "ERROR":
+                logger.error(log_msg)
+            elif log_level == "WARNING":
+                logger.warning(log_msg)
+            elif log_level == "INFO":
+                logger.info(log_msg)
+            elif log_level == "DEBUG":
+                logger.debug(log_msg)
cognee/modules/data/methods/__init__.py (1)

10-10: Address unused import warning.

The get_unique_dataset_id function is imported but not used in this file. Since this appears to be intentionally exposing the function through the module's public API, consider one of these approaches:

  1. Add it to __all__ to explicitly mark it as part of the public API
  2. Use a redundant alias to indicate intentional re-export

This will also resolve the F401 lint warning.

+# Define public API
+__all__ = [
+    "create_dataset",
+    "get_dataset", "get_datasets", "get_datasets_by_name", "get_dataset_data", "get_data", "get_unique_dataset_id",
+    "delete_dataset", "delete_data"
+]
🧰 Tools
🪛 Ruff (0.8.2)

10-10: .get_unique_dataset_id.get_unique_dataset_id imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

notebooks/llama_index_cognee_integration.ipynb (1)

6-8: Updated notebook string formatting for consistency.

The notebook has been updated to convert single-line string sources into single-element lists of strings, which standardizes the JSON structure. This is a non-functional change that improves consistency in the notebook format.

Also applies to: 62-64, 199-201, 219-221

cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)

332-334: Explicit schema specification improves predictability

Replacing dynamic schema discovery with a fixed list of schemas (["public", "public_staging"]) makes the database deletion operation more predictable and controlled.

However, be cautious about hardcoding schema names as it might miss any new schemas added in the future.

Consider adding a comment explaining why these specific schemas are targeted and what implications this might have for future schema additions.

examples/data/car_and_tech_companies.txt (1)

1-37: Sample text data suitable for demonstrations.

This new file provides structured, descriptive text examples that can be used for testing and demonstrating the system's capabilities.

However, there's a minor grammatical error:

-Each of these car manufacturer contributes to Germany's reputation as a leader in the global automotive industry, showcasing a blend of innovation, performance, and design excellence.
+Each of these car manufacturers contributes to Germany's reputation as a leader in the global automotive industry, showcasing a blend of innovation, performance, and design excellence.
🧰 Tools
🪛 LanguageTool

[duplication] ~2-~2: Possible typo: you repeated a word.
Context: text_1 = """ 1. Audi Audi is known for its modern designs and adv...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~5-~5: Possible typo: you repeated a word.
Context: ...ns to high-performance sports cars. 2. BMW BMW, short for Bayerische Motoren Werke, is...

(ENGLISH_WORD_REPEAT_RULE)


[style] ~6-~6: Consider using a more concise synonym.
Context: ... reflects that commitment. BMW produces a variety of cars that combine luxury with sporty pe...

(A_VARIETY_OF)


[duplication] ~8-~8: Possible typo: you repeated a word.
Context: ...ine luxury with sporty performance. 3. Mercedes-Benz Mercedes-Benz is synonymous with luxury and quality. ...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~11-~11: Possible typo: you repeated a word.
Context: ... catering to a wide range of needs. 4. Porsche Porsche is a name that stands for high-performa...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~14-~14: Possible typo: you repeated a word.
Context: ...o value both performance and style. 5. Volkswagen Volkswagen, which means "people's car" in German, ...

(ENGLISH_WORD_REPEAT_RULE)


[grammar] ~17-~17: The plural determiner ‘these’ does not agree with the singular noun ‘car’.
Context: ...nce practicality with quality. Each of these car manufacturer contributes to Germany's r...

(THIS_NNS)


[uncategorized] ~17-~17: The grammatical number of this noun doesn’t look right. Consider replacing it.
Context: ...cality with quality. Each of these car manufacturer contributes to Germany's reputation as ...

(AI_EN_LECTOR_REPLACEMENT_NOUN_NUMBER)


[duplication] ~21-~21: Possible typo: you repeated a word.
Context: ...design excellence. """ text_2 = """ 1. Apple Apple is renowned for its innovative consumer...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~27-~27: Possible typo: you repeated a word.
Context: ... in shaping the internet landscape. 3. Microsoft Microsoft Corporation has been a dominant force i...

(ENGLISH_WORD_REPEAT_RULE)


[style] ~28-~28: Consider using a synonym to be more concise.
Context: ...n both business and personal computing. In recent years, Microsoft has expanded into cloud comp...

(IN_RECENT_STYLE)


[uncategorized] ~31-~31: You might be missing the article “the” here.
Context: ...or innovation continues to reshape both retail and technology sectors. 5. Meta Meta, ...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


[duplication] ~33-~33: Possible typo: you repeated a word.
Context: ...both retail and technology sectors. 5. Meta Meta, originally known as Facebook, revoluti...

(ENGLISH_WORD_REPEAT_RULE)

cognee/modules/retrieval/utils/brute_force_triplet_search.py (1)

66-73: Consider using contextlib.suppress for cleaner error handling.

The try-except-pass pattern can be more concisely expressed using contextlib.suppress.

-    try:
-        await memory_fragment.project_graph_from_db(
-            graph_engine,
-            node_properties_to_project=properties_to_project,
-            edge_properties_to_project=["relationship_name"],
-        )
-    except EntityNotFoundError:
-        pass
+    from contextlib import suppress
+    with suppress(EntityNotFoundError):
+        await memory_fragment.project_graph_from_db(
+            graph_engine,
+            node_properties_to_project=properties_to_project,
+            edge_properties_to_project=["relationship_name"],
+        )
🧰 Tools
🪛 Ruff (0.8.2)

66-73: Use contextlib.suppress(EntityNotFoundError) instead of try-except-pass

Replace with contextlib.suppress(EntityNotFoundError)

(SIM105)

cognee/modules/pipelines/operations/run_tasks_base.py (1)

31-34: Simplify dictionary key check.

The check for "context" in function parameters can be simplified.

-    has_context = any(
-        [key == "context" for key in inspect.signature(running_task.executable).parameters.keys()]
-    )
+    has_context = "context" in inspect.signature(running_task.executable).parameters
🧰 Tools
🪛 Ruff (0.8.2)

32-32: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (1)

165-168: Inconsistent error type

InvalidValueError inherits from CogneeApiError, which by default logs at
ERROR level. The invalid-argument condition here is not exceptional and
should normally return HTTP 422 without spamming error logs. Consider using
log=False when raising:

raise InvalidValueError(
    message="One of query_text or query_vector must be provided!",
    log=False
)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)

250-252: Guard against empty search result earlier

A quick exit before normalisation avoids unnecessary CPU cycles:

-        if len(vector_list) == 0:
-            return []
+        if not vector_list:
+            return []
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (2)

76-82: Minor: avoid double round-trip in get_collection

await connection.open_table already throws if the table is missing; the preceding has_collection check incurs an extra network call.
Consider catching the library-specific “not found” error instead of pre-checking to reduce latency.


205-211: Batch deletes can be executed in a single predicate

Looping one DELETE per ID issues N commits:

for data_point_id in data_point_ids:
    await collection.delete(f"id = '{data_point_id}'")

Instead, delete once:

ids = "', '".join(data_point_ids)
await collection.delete(f"id IN ('{ids}')")

Reduces I/O and lock contention.

cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1)

48-50: Incorrect return type annotation for embeddings

embed_text returns a list of vectors (list[list[float]]), but the signature advertises List[float].
This misleads type-checkers and maintainers.

-async def embed_data(self, data: List[str]) -> List[float]:
+async def embed_data(self, data: List[str]) -> List[List[float]]:
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (1)

243-249: Keep default limits consistent with search

search() now defaults to limit=15, while batch_search() keeps limit=5. Consider aligning the defaults to avoid surprising behaviour for callers that switch between the two helpers.

-async def batch_search(
-    self,
-    collection_name: str,
-    query_texts: List[str],
-    limit: int = 5,
+async def batch_search(
+    self,
+    collection_name: str,
+    query_texts: List[str],
+    limit: int = 15,
cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py (1)

430-444: Mutable default argument detected

Using {} as a default is shared across invocations.

-def serialize_properties(self, properties=dict()):
+def serialize_properties(self, properties: Optional[dict] = None):
     serialized_properties = {}
+
+    if properties is None:
+        properties = {}

[B006]

🧰 Tools
🪛 Ruff (0.8.2)

430-430: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🛑 Comments failed to post (14)
cognee/modules/observability/get_observe.py (1)

5-11: 🛠️ Refactor suggestion

New centralized observability function lacks handling for all observer types

This new function centralizes the logic for obtaining the observe decorator, which is a good practice. However, it only handles the LANGFUSE observer type and doesn't provide implementations for other types like LLMLITE or LANGSMITH, nor does it include a default fallback.

Add handling for all observer types defined in the Observer enum and include a default fallback:

 def get_observe():
     monitoring = get_base_config().monitoring_tool
 
     if monitoring == Observer.LANGFUSE:
         from langfuse.decorators import observe
 
         return observe
+    elif monitoring == Observer.LLMLITE:
+        # Import and return the appropriate observe decorator
+        pass
+    elif monitoring == Observer.LANGSMITH:
+        # Import and return the appropriate observe decorator
+        pass
+    else:
+        # Return a no-op decorator as fallback
+        def noop_observe(*args, **kwargs):
+            def decorator(func):
+                return func
+            return decorator if callable(args[0]) else decorator
+        
+        return noop_observe

This ensures the function works for all configured observer types and provides a safe fallback.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents (early access)
In cognee/modules/observability/get_observe.py around lines 5 to 11, the
get_observe function only handles the LANGFUSE observer type and lacks
implementations for other observer types like LLMLITE and LANGSMITH, as well as
a default fallback. Update the function to include conditional branches for each
observer type defined in the Observer enum, importing and returning the
appropriate observe decorator for each. Also, add a default fallback return
value or decorator to ensure the function safely handles unexpected or unset
observer types.
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)

227-236: ⚠️ Potential issue

Row unpacking bug – attributes accessed on wrong object

session.execute(query) returns Result rows shaped like
(<PGVectorDataPoint>, similarity).
Inside the loop vector is the row, not the model instance, therefore
vector.id and vector.payload are undefined.

-        for vector in closest_items.all():
-            vector_list.append(
-                {
-                    "id": parse_id(str(vector.id)),
-                    "payload": vector.payload,
-                    "_distance": vector.similarity,
-                }
-            )
+        for row in closest_items.all():
+            data_point, similarity = row
+            vector_list.append(
+                {
+                    "id": parse_id(str(data_point.id)),
+                    "payload": data_point.payload,
+                    "_distance": similarity,
+                }
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

            query = select(
                PGVectorDataPoint,
                PGVectorDataPoint.c.vector.cosine_distance(query_vector).label("similarity"),
            ).order_by("similarity")

            if limit > 0:
                query = query.limit(limit)

            # Find closest vectors to query_vector
            closest_items = await session.execute(query)

            for row in closest_items.all():
                data_point, similarity = row
                vector_list.append(
                    {
                        "id": parse_id(str(data_point.id)),
                        "payload": data_point.payload,
                        "_distance": similarity,
                    }
                )
🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py around
lines 227 to 236, the result of session.execute(query) returns rows as tuples of
(PGVectorDataPoint, similarity), but the code incorrectly treats each row as a
PGVectorDataPoint instance. To fix this, unpack each row into its components
(e.g., vector, similarity) before accessing attributes like id or payload on the
vector object.
cognee-mcp/src/server.py (3)

46-48: ⚠️ Potential issue

Chain the original exception when re-raising

Static analysis (B904) correctly points out that the root exception is discarded:

except Exception as e:
    ...
    raise ValueError(f"Failed to cognify: {str(e)}")

Using raise … from e preserves the traceback and simplifies debugging.

-            except Exception as e:
-                logger.error("Cognify process failed.")
-                raise ValueError(f"Failed to cognify: {str(e)}")
+            except Exception as e:
+                logger.error("Cognify process failed.")
+                raise ValueError(f"Failed to cognify: {e}") from e
🧰 Tools
🪛 Ruff (0.8.2)

48-48: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents (early access)
In cognee-mcp/src/server.py around lines 46 to 48, the exception is re-raised
without chaining the original exception, which discards the traceback. Modify
the raise statement to use "raise ValueError(...) from e" to preserve the
original exception context and traceback for better debugging.

111-120: ⚠️ Potential issue

Validate search_type to prevent KeyError crashes

SearchType[search_type.upper()] will throw a KeyError for any unexpected value, crashing the tool.

Add an explicit whitelist or catch the error:

-            search_results = await cognee.search(
-                query_type=SearchType[search_type.upper()], query_text=search_query
-            )
+            try:
+                query_type = SearchType[search_type.upper()]
+            except KeyError:
+                raise ValueError(
+                    f"Unsupported search_type '{search_type}'. "
+                    f"Allowed values: {[t.name.lower() for t in SearchType]}"
+                ) from None
+
+            search_results = await cognee.search(
+                query_type=query_type, query_text=search_query
+            )

Also applies to: 126-128

🤖 Prompt for AI Agents (early access)
In cognee-mcp/src/server.py around lines 111 to 120, the code uses
SearchType[search_type.upper()] without validating search_type, which can cause
a KeyError if the value is unexpected. To fix this, add explicit validation by
checking if search_type.upper() is in the allowed SearchType keys before using
it, or wrap the access in a try-except block to catch KeyError and handle it
gracefully. Apply the same validation or error handling to lines 126-128 where
SearchType is also accessed.

50-56: 🛠️ Refactor suggestion

Background tasks run “fire-and-forget” – add error handling to avoid silent failures

asyncio.create_task() detaches the pipeline tasks, but unhandled exceptions inside them are never surfaced.
Attach a done-callback or wrap the whole task in asyncio.create_task(_task, name=…) and inspect task.exception() to log failures.

Example:

task = asyncio.create_task(cognify_task(...), name="cognify_task")
task.add_done_callback(
    lambda t: logger.error("Cognify crashed: %s", t.exception()) if t.exception() else None
)

Also applies to: 88-94

🤖 Prompt for AI Agents (early access)
In cognee-mcp/src/server.py around lines 50 to 56, the asyncio.create_task call
runs the cognify_task as a fire-and-forget background task without error
handling, causing exceptions to be silently ignored. To fix this, assign the
created task to a variable, give it a name, and add a done-callback that checks
if the task raised an exception; if so, log the error using the logger. Apply
the same pattern to the similar code block at lines 88 to 94.
cognee/infrastructure/databases/graph/networkx/adapter.py (1)

45-46: ⚠️ Potential issue

Type-hint drift from the abstract interface

GraphDBInterface.has_node() still expects node_id: str, but the adapter signature was changed to UUID.
This breaks LSP compliance and will confuse call-sites still passing strings.

Align the types across all adapters (or accept both and cast internally) and update the interface if the project is truly migrating to UUIDs.

🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/graph/networkx/adapter.py around lines 45 to
46, the method has_node uses node_id typed as UUID, but the abstract interface
expects a string, causing a type-hint mismatch and breaking Liskov Substitution
Principle compliance. To fix this, align the type hints by either changing the
adapter method signature to accept a string as in the interface or update the
interface to accept UUID if migrating to UUIDs. Alternatively, modify the
adapter to accept both types and internally cast strings to UUID before use,
ensuring consistent type usage across all adapters and the interface.
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (2)

43-47: 🛠️ Refactor suggestion

Avoid reconnecting on every call to get_client

await self.client.connect() is executed each time get_client() is called.
For long-lived adapters this results in unnecessary hand-shakes and connection churn, hurting latency and exhausting connection limits.

 async def get_client(self):
-    await self.client.connect()
-    return self.client
+    if not self.client.is_connected():   # pseudo–API – adjust to real attribute
+        await self.client.connect()
+    return self.client

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py around
lines 43 to 47, the get_client method calls await self.client.connect() every
time it is invoked, causing unnecessary reconnections. Modify the method to
check if the client is already connected before calling connect, or establish
the connection once during adapter initialization and simply return the
connected client in get_client without reconnecting.

116-122: ⚠️ Potential issue

Missing await on async API – risk of silent coroutine

collection.data.exists and the subsequent update/insert helpers are asynchronous in the async Weaviate client.
Calling them without await returns a coroutine, so the condition is always truthy, which breaks the upsert logic.

-if collection.data.exists(data_point.uuid):
-    return await collection.data.update(
+if await collection.data.exists(data_point.uuid):
+    return await collection.data.update(

Apply the same pattern to collection.data.insert in the else branch if that method is also async.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

                if await collection.data.exists(data_point.uuid):
                    return await collection.data.update(
                        uuid=data_point.uuid,
                        vector=data_point.vector,
                        properties=data_point.properties,
                        references=data_point.references,
                    )
🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py around
lines 116 to 122, the calls to collection.data.exists and collection.data.update
are asynchronous but missing the await keyword, causing them to return
coroutines and break the upsert logic. Add await before collection.data.exists
and collection.data.update calls to properly await their completion. Also, check
if collection.data.insert in the else branch is async and add await there as
well to ensure correct asynchronous behavior.
cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py (6)

255-258: ⚠️ Potential issue

Relationship type accessed via indexing – use .type attribute

Neo4j driver returns neo4j.graph.Relationship; subscripting it (rel[1]) is undefined behaviour and will raise TypeError.
Retrieve the type with rel.type (or type(rel) in Memgraph).

-(result["n"]["id"], result["m"]["id"], {"relationship_name": result["r"][1]})
+(
+    result["n"]["id"],
+    result["m"]["id"],
+    {"relationship_name": result["r"].type},
+)

Apply the same fix in get_connections.

Also applies to: 387-393

🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 255 to 258, the code accesses the relationship type using indexing
(result["r"][1]), which is incorrect and causes a TypeError. Replace this with
accessing the relationship type via the .type attribute (result["r"].type).
Also, apply the same fix in the get_connections method and similarly update
lines 387 to 393 where this pattern occurs.

67-78: 🛠️ Refactor suggestion

Dynamic labels are not parameterisable

Cypher (incl. Memgraph) does not allow label names to be supplied via parameters (e.g. node:$node_label).
This query will fail at compile time.

Replace with string interpolation that safely whitelists labels:

label = type(node).__name__
if not re.fullmatch(r"[A-Za-z_][A-Za-z0-9_]*", label):
    raise ValueError(f"Invalid label: {label}")
query = f"""
MERGE (node {{id: $node_id}})
ON CREATE SET node:`{label}`, node += $properties, node.updated_at = timestamp()
ON MATCH  SET node:`{label}`, node += $properties, node.updated_at = timestamp()
RETURN ID(node) AS internal_id, node.id AS nodeId
"""
🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 67 to 78, the Cypher query incorrectly uses a parameter for the node
label, which is not supported and causes a compile-time error. Fix this by
extracting the label as a string from the node's type name, validating it
against a whitelist regex to ensure it is a safe label, and then use Python
string interpolation to insert the label directly into the query string instead
of using a parameter. Keep other parameters like node_id and properties as query
parameters.

414-417: ⚠️ Potential issue

Interpolating edge_label and id directly into labels creates invalid Cypher & risk of injection

MATCH (node:`{id}`)<-[r:{edge_label}]-(successor)
  • id is undefined, and label injection is unsafe.

Refactor to parameterised properties instead of dynamic labels:

UNWIND $node_ids AS nid
MATCH (node {id: nid})<-[r]-(successor)
WHERE type(r) = $edge_label
DELETE r
🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 414 to 417, the Cypher query incorrectly interpolates `id` and
`edge_label` directly into labels, causing invalid syntax and injection risks.
Replace dynamic label usage with parameterized properties by matching nodes with
a property filter (e.g., `node {id: nid}`) and filter relationships by type
using a WHERE clause with a parameter for `edge_label`. Adjust the query to use
UNWIND with parameterized node IDs and pass `edge_label` as a parameter to
ensure safe and valid Cypher syntax.

122-124: ⚠️ Potential issue

Malformed Cypher pattern – double curly braces

MATCH (node: {{id: $node_id}}) is invalid syntax.
Probably meant to match on a property:

-query = "MATCH (node: {{id: $node_id}}) DETACH DELETE node"
+query = "MATCH (node {id: $node_id}) DETACH DELETE node"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        query = "MATCH (node {id: $node_id}) DETACH DELETE node"
        params = {"node_id": sanitized_id}
🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 122 to 124, the Cypher query uses double curly braces in the pattern MATCH
(node: {{id: $node_id}}), which is invalid syntax. Replace the double curly
braces with a property match syntax by using MATCH (node {id: $node_id}) to
correctly match nodes with the given id property.

3-5: ⚠️ Potential issue

ERROR symbol not exported from logging_utils

cognee.shared.logging_utils only exposes get_logger; importing ERROR raises ImportError at import time.

-from cognee.shared.logging_utils import get_logger, ERROR
+import logging
+from cognee.shared.logging_utils import get_logger

and call get_logger("MemgraphAdapter", level=logging.ERROR) instead.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

import json
import logging
from cognee.shared.logging_utils import get_logger
import asyncio
🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py lines 3 to
5, the import of ERROR from cognee.shared.logging_utils is invalid because ERROR
is not exported. Remove the import of ERROR and instead import the standard
logging module. Then, when calling get_logger, pass the log level as
logging.ERROR using the standard logging module.

400-405: 🛠️ Refactor suggestion

Undefined variables injected into f-string

Inside the f-string {id: nid} tries to evaluate id (the built-in function) and nid (undefined).
Use double braces or parameters instead:

-query = f"""
-UNWIND $node_ids AS nid
-MATCH (node {id: nid})-[r]->(predecessor)
-WHERE type(r) = $edge_label
-DELETE r;
-"""
+query = """
+UNWIND $node_ids AS nid
+MATCH (node {id: nid})-[r]->(predecessor)
+WHERE type(r) = $edge_label
+DELETE r
+"""
🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 400 to 405, the f-string incorrectly injects undefined variables by using
{id: nid}, which tries to evaluate the built-in id function and an undefined
variable nid. To fix this, replace the f-string expression with parameterized
query syntax or use double braces to escape the curly braces so that the query
treats {id: nid} as a literal map pattern rather than Python expressions.

@soobrosa
Copy link
Contributor

@borisarzentar neat caching, might still make sense to leave a bit of a documentation on which extra is what -- I mean it could also be in pyproject.toml

FROM python:3.11-slim

# Define Poetry extras to install
ARG POETRY_EXTRAS="\
Copy link
Collaborator

@dexters1 dexters1 May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to define the Dockerfile used extras at the beginning of the Dockerfile like before, it will make it easier for updating/maintaining later on. Especially since extras are now called in two different places in the Dockerfile

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't pass all extras at once, like we did with poetry. Here we have to pass one by one, so I would have to iterate somehow through extras and add each separately. I don't want to spend time on that now, we can change it later.

@Vasilije1990 Vasilije1990 self-requested a review May 14, 2025 16:24
Vasilije1990
Vasilije1990 previously approved these changes May 14, 2025
hajdul88
hajdul88 previously approved these changes May 14, 2025
@@ -127,6 +126,7 @@ dev = [
"mkdocs-minify-plugin>=0.8.0,<0.9",
"mkdocstrings[python]>=0.26.2,<0.27",
]
debug = ["debugpy==1.8.9"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add this under dev dependencies?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was there, but sometimes I need just debugpy and nothing else. And in case of Docker container, we have a debug option there so we install now only this extra instead of the whole dev.


if not data:
data: list[Data] = await get_dataset_data(dataset_id=dataset_id)

# async with update_status_lock: TODO: Add UI lock to prevent multiple backend requests
if isinstance(dataset, Dataset):
task_status = await get_pipeline_status([dataset_id])
task_status = await get_pipeline_status([dataset_id], pipeline_name)
Copy link
Collaborator

@dexters1 dexters1 May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a breaking change for cognee-mcp checking of pipeline statuses. As there both add and cognify are under one pipeline

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not technically one pipeline, they are just called from one function. But two pipelines will run. What do you think will break?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't initially see cognee-mcp was updated as well, you can disregard this comment

@@ -144,7 +144,9 @@ async def cognify_status():
"""Get status of cognify pipeline"""
with redirect_stdout(sys.stderr):
user = await get_default_user()
status = await get_pipeline_status([await get_unique_dataset_id("main_dataset", user)])
status = await get_pipeline_status(
[await get_unique_dataset_id("main_dataset", user)], "cognify_pipeline"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the add pipeline status? That also needs to be checked along with cognify for Cognee-mcp cognify

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When both are complete the status should be complete and if the add pipeline is working status needs to be processing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen now if the user would check the status while the add pipeline is running?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In mcp add and cognify are merged into one, so I think that checking the cognify_pipeline status should be enough. Add happens usually very quickly, so I'm not sure if that can happen.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the users check right away I it will most likely happen, which is pretty realistic for them to do. Would be great to treat the return status as one pipeline then as well by checking state of both and handling the return based on combined states (or something like this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extension of scope or unclear scope, let's create a ticket for this one and merge with existing functionality

@borisarzentar borisarzentar dismissed stale reviews from hajdul88 and Vasilije1990 via cd72204 May 14, 2025 20:08
@Vasilije1990 Vasilije1990 self-requested a review May 15, 2025 08:05
@Vasilije1990 Vasilije1990 merged commit 0f3522e into dev May 15, 2025
49 of 50 checks passed
@Vasilije1990 Vasilije1990 deleted the fix/cognee-docker-image branch May 15, 2025 08:05
@coderabbitai coderabbitai bot mentioned this pull request Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants