-
Notifications
You must be signed in to change notification settings - Fork 598
fix: cognee docker image #820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please make sure all the checkboxes are checked:
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis update introduces significant architectural and functional changes across the codebase. Major highlights include: a multi-stage Docker build using a custom base image and switching from Poetry to Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Pipeline
participant Task
participant Context
User->>Pipeline: run_tasks(tasks, data, user, context)
Pipeline->>Task: for each task, handle_task(task, args, context)
Task->>Task: if executable accepts context, pass context
Task->>Context: use context in execution (optional)
Task-->>Pipeline: return result
Pipeline-->>User: yield result
sequenceDiagram
participant API
participant MemgraphAdapter
participant MemgraphDB
API->>MemgraphAdapter: create_graph_engine(provider="memgraph", credentials)
MemgraphAdapter->>MemgraphDB: connect using Neo4j async driver
API->>MemgraphAdapter: add_node / add_edge / query / get_metrics
MemgraphAdapter->>MemgraphDB: execute Cypher queries asynchronously
MemgraphDB-->>MemgraphAdapter: return results
MemgraphAdapter-->>API: return processed data
Possibly related PRs
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 14
🔭 Outside diff range comments (8)
cognee/exceptions/exceptions.py (1)
34-64
: 🛠️ Refactor suggestionUpdate subclasses to use new parameters
The subclasses (
ServiceError
,InvalidValueError
,InvalidAttributeError
) should be updated to pass the new parameters to the parent class.Update the subclasses to include the new parameters:
class ServiceError(CogneeApiError): """Failures in external services or APIs, like a database or a third-party service""" def __init__( self, message: str = "Service is unavailable.", name: str = "ServiceError", status_code=status.HTTP_422_UNPROCESSABLE_ENTITY, + log=True, + log_level="ERROR", ): - super().__init__(message, name, status_code) + super().__init__(message, name, status_code, log, log_level)Apply similar changes to
InvalidValueError
andInvalidAttributeError
.cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (2)
48-54
: 🛠️ Refactor suggestionAvoid synchronous DB calls inside the event-loop
client.has_collection()
is a blocking, synchronous SDK call that is executed directly inside anasync
method.
Running blocking I/O in the event-loop will freeze other coroutines while the call is in progress.- future = asyncio.Future() - client = self.get_milvus_client() - future.set_result(client.has_collection(collection_name=collection_name)) - - return await future + loop = asyncio.get_running_loop() + client = self.get_milvus_client() + return await loop.run_in_executor( + None, lambda: client.has_collection(collection_name=collection_name) + )
99-115
: 🛠️ Refactor suggestionPotential event-loop blocking in
create_data_points
client.insert()
is another synchronous SDK call executed directly. For large batches this can stall the entire asyncio application.Consider delegating the insertion to a thread-pool:
- result = client.insert(collection_name=collection_name, data=insert_data) + loop = asyncio.get_running_loop() + result = await loop.run_in_executor( + None, lambda: client.insert(collection_name=collection_name, data=insert_data) + )cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (2)
100-126
:⚠️ Potential issue
upload_points
is never awaited – points are not stored
AsyncQdrantClient.upload_points()
returns a coroutine that must be awaited.
Without theawait
, the upload is silently skipped and the surroundingtry
block will never raise the expectedUnexpectedResponse
.- client.upload_points(collection_name=collection_name, points=points) + await client.upload_points(collection_name=collection_name, points=points)
163-205
: 🛠️ Refactor suggestionDouble client-close and unreachable
finally
await client.close()
appears both:
- inside the
try
block (line 183)- again in the
finally
clause (line 204)The second call will attempt to close an already-closed connection and can
raiseRuntimeError: Event loop is closed
under heavy load.Remove the close inside the
try
block and keep thefinally
guard.- await client.close()
cognee/infrastructure/databases/graph/networkx/adapter.py (2)
108-113
: 🛠️ Refactor suggestionEdge/node ID inconsistency will create orphan edges
add_edges()
converts UUIDs to strings before storage, yet every read-side API (get_edges
,has_node
, etc.) now expects rawUUID
objects.
As a result:
has_edge()
will succeed (edge keys are strings) whilehas_node()
uses UUIDs.get_edges(UUID)
will likely return an empty list because the underlying graph stores the string version.Store IDs in a single, canonical form (preferably
UUID
) everywhere or perform explicit conversion on both write and read paths.Also applies to: 139-143
303-304
:⚠️ Potential issue
node_link_data
extra argument is invalid
nx.readwrite.json_graph.node_link_data
does not take anedges=
parameter; unexpected kwargs raiseTypeError
on recent NetworkX versions.-graph_data = nx.readwrite.json_graph.node_link_data(self.graph, edges="links") +graph_data = nx.readwrite.json_graph.node_link_data(self.graph)cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1)
90-97
: 🛠️ Refactor suggestionQuadratic look-up when pairing embeddings with datapoints
Using
data_points.index(data_point)
inside a list-comprehension yields O(n²) complexity and gives the wrong vector when duplicate instances appear.-def convert_to_weaviate_data_points(data_point: DataPoint): - vector = data_vectors[data_points.index(data_point)] +def convert_to_weaviate_data_points(pair): + data_point, vector = pair @@ -data_points = [convert_to_weaviate_data_points(data_point) for data_point in data_points] +data_points = [ + convert_to_weaviate_data_points(p) + for p in zip(data_points, data_vectors) +]
♻️ Duplicate comments (1)
cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py (1)
82-88
: Same issue in bulk-insert query
n:node.label
suffers from the same dynamic-label problem and will attempt to create the literal labelnode.label
.Ensure labels are interpolated before sending the Cypher and properly validated.
🧹 Nitpick comments (16)
cognee/modules/retrieval/exceptions/__init__.py (1)
7-7
: Add imported exceptions to__all__
or use them directlyThe imported exceptions
SearchTypeNotSupported
andCypherSearchError
are currently unused in this module.Either add them to an
__all__
list to make them available to users of this module, or use a redundant alias import pattern. For example:from .exceptions import SearchTypeNotSupported, CypherSearchError + __all__ = ["SearchTypeNotSupported", "CypherSearchError"]
🧰 Tools
🪛 Ruff (0.8.2)
7-7:
.exceptions.SearchTypeNotSupported
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
7-7:
.exceptions.CypherSearchError
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
cognee/exceptions/exceptions.py (2)
15-16
: Great addition of logging control parametersAdding
log
andlog_level
parameters provides better flexibility for controlling error logging behavior.Consider using an enum for
log_level
instead of string literals to prevent potential typos and provide better IDE support.
23-30
: Well-implemented conditional loggingThe conditional logging implementation correctly handles different log levels, providing granular control over how exceptions are logged.
A small optimization would be to use a dictionary mapping or match/case statement (if using Python 3.10+) instead of multiple if-elif conditions:
- if log and (log_level == "ERROR"): - logger.error(f"{self.name}: {self.message} (Status code: {self.status_code})") - elif log and (log_level == "WARNING"): - logger.warning(f"{self.name}: {self.message} (Status code: {self.status_code})") - elif log and (log_level == "INFO"): - logger.info(f"{self.name}: {self.message} (Status code: {self.status_code})") - elif log and (log_level == "DEBUG"): - logger.debug(f"{self.name}: {self.message} (Status code: {self.status_code})") + if log: + log_msg = f"{self.name}: {self.message} (Status code: {self.status_code})" + if log_level == "ERROR": + logger.error(log_msg) + elif log_level == "WARNING": + logger.warning(log_msg) + elif log_level == "INFO": + logger.info(log_msg) + elif log_level == "DEBUG": + logger.debug(log_msg)cognee/modules/data/methods/__init__.py (1)
10-10
: Address unused import warning.The
get_unique_dataset_id
function is imported but not used in this file. Since this appears to be intentionally exposing the function through the module's public API, consider one of these approaches:
- Add it to
__all__
to explicitly mark it as part of the public API- Use a redundant alias to indicate intentional re-export
This will also resolve the F401 lint warning.
+# Define public API +__all__ = [ + "create_dataset", + "get_dataset", "get_datasets", "get_datasets_by_name", "get_dataset_data", "get_data", "get_unique_dataset_id", + "delete_dataset", "delete_data" +]🧰 Tools
🪛 Ruff (0.8.2)
10-10:
.get_unique_dataset_id.get_unique_dataset_id
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
notebooks/llama_index_cognee_integration.ipynb (1)
6-8
: Updated notebook string formatting for consistency.The notebook has been updated to convert single-line string sources into single-element lists of strings, which standardizes the JSON structure. This is a non-functional change that improves consistency in the notebook format.
Also applies to: 62-64, 199-201, 219-221
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)
332-334
: Explicit schema specification improves predictabilityReplacing dynamic schema discovery with a fixed list of schemas (
["public", "public_staging"]
) makes the database deletion operation more predictable and controlled.However, be cautious about hardcoding schema names as it might miss any new schemas added in the future.
Consider adding a comment explaining why these specific schemas are targeted and what implications this might have for future schema additions.
examples/data/car_and_tech_companies.txt (1)
1-37
: Sample text data suitable for demonstrations.This new file provides structured, descriptive text examples that can be used for testing and demonstrating the system's capabilities.
However, there's a minor grammatical error:
-Each of these car manufacturer contributes to Germany's reputation as a leader in the global automotive industry, showcasing a blend of innovation, performance, and design excellence. +Each of these car manufacturers contributes to Germany's reputation as a leader in the global automotive industry, showcasing a blend of innovation, performance, and design excellence.🧰 Tools
🪛 LanguageTool
[duplication] ~2-~2: Possible typo: you repeated a word.
Context: text_1 = """ 1. Audi Audi is known for its modern designs and adv...(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~5-~5: Possible typo: you repeated a word.
Context: ...ns to high-performance sports cars. 2. BMW BMW, short for Bayerische Motoren Werke, is...(ENGLISH_WORD_REPEAT_RULE)
[style] ~6-~6: Consider using a more concise synonym.
Context: ... reflects that commitment. BMW produces a variety of cars that combine luxury with sporty pe...(A_VARIETY_OF)
[duplication] ~8-~8: Possible typo: you repeated a word.
Context: ...ine luxury with sporty performance. 3. Mercedes-Benz Mercedes-Benz is synonymous with luxury and quality. ...(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~11-~11: Possible typo: you repeated a word.
Context: ... catering to a wide range of needs. 4. Porsche Porsche is a name that stands for high-performa...(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~14-~14: Possible typo: you repeated a word.
Context: ...o value both performance and style. 5. Volkswagen Volkswagen, which means "people's car" in German, ...(ENGLISH_WORD_REPEAT_RULE)
[grammar] ~17-~17: The plural determiner ‘these’ does not agree with the singular noun ‘car’.
Context: ...nce practicality with quality. Each of these car manufacturer contributes to Germany's r...(THIS_NNS)
[uncategorized] ~17-~17: The grammatical number of this noun doesn’t look right. Consider replacing it.
Context: ...cality with quality. Each of these car manufacturer contributes to Germany's reputation as ...(AI_EN_LECTOR_REPLACEMENT_NOUN_NUMBER)
[duplication] ~21-~21: Possible typo: you repeated a word.
Context: ...design excellence. """ text_2 = """ 1. Apple Apple is renowned for its innovative consumer...(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~27-~27: Possible typo: you repeated a word.
Context: ... in shaping the internet landscape. 3. Microsoft Microsoft Corporation has been a dominant force i...(ENGLISH_WORD_REPEAT_RULE)
[style] ~28-~28: Consider using a synonym to be more concise.
Context: ...n both business and personal computing. In recent years, Microsoft has expanded into cloud comp...(IN_RECENT_STYLE)
[uncategorized] ~31-~31: You might be missing the article “the” here.
Context: ...or innovation continues to reshape both retail and technology sectors. 5. Meta Meta, ...(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[duplication] ~33-~33: Possible typo: you repeated a word.
Context: ...both retail and technology sectors. 5. Meta Meta, originally known as Facebook, revoluti...(ENGLISH_WORD_REPEAT_RULE)
cognee/modules/retrieval/utils/brute_force_triplet_search.py (1)
66-73
: Consider using contextlib.suppress for cleaner error handling.The try-except-pass pattern can be more concisely expressed using
contextlib.suppress
.- try: - await memory_fragment.project_graph_from_db( - graph_engine, - node_properties_to_project=properties_to_project, - edge_properties_to_project=["relationship_name"], - ) - except EntityNotFoundError: - pass + from contextlib import suppress + with suppress(EntityNotFoundError): + await memory_fragment.project_graph_from_db( + graph_engine, + node_properties_to_project=properties_to_project, + edge_properties_to_project=["relationship_name"], + )🧰 Tools
🪛 Ruff (0.8.2)
66-73: Use
contextlib.suppress(EntityNotFoundError)
instead oftry
-except
-pass
Replace with
contextlib.suppress(EntityNotFoundError)
(SIM105)
cognee/modules/pipelines/operations/run_tasks_base.py (1)
31-34
: Simplify dictionary key check.The check for "context" in function parameters can be simplified.
- has_context = any( - [key == "context" for key in inspect.signature(running_task.executable).parameters.keys()] - ) + has_context = "context" in inspect.signature(running_task.executable).parameters🧰 Tools
🪛 Ruff (0.8.2)
32-32: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (1)
165-168
: Inconsistent error type
InvalidValueError
inherits fromCogneeApiError
, which by default logs at
ERROR level. The invalid-argument condition here is not exceptional and
should normally return HTTP 422 without spamming error logs. Consider using
log=False
when raising:raise InvalidValueError( message="One of query_text or query_vector must be provided!", log=False )cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)
250-252
: Guard against empty search result earlierA quick exit before normalisation avoids unnecessary CPU cycles:
- if len(vector_list) == 0: - return [] + if not vector_list: + return []cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (2)
76-82
: Minor: avoid double round-trip inget_collection
await connection.open_table
already throws if the table is missing; the precedinghas_collection
check incurs an extra network call.
Consider catching the library-specific “not found” error instead of pre-checking to reduce latency.
205-211
: Batch deletes can be executed in a single predicateLooping one
DELETE
per ID issuesN
commits:for data_point_id in data_point_ids: await collection.delete(f"id = '{data_point_id}'")Instead, delete once:
ids = "', '".join(data_point_ids) await collection.delete(f"id IN ('{ids}')")Reduces I/O and lock contention.
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1)
48-50
: Incorrect return type annotation for embeddings
embed_text
returns a list of vectors (list[list[float]]), but the signature advertisesList[float]
.
This misleads type-checkers and maintainers.-async def embed_data(self, data: List[str]) -> List[float]: +async def embed_data(self, data: List[str]) -> List[List[float]]:cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (1)
243-249
: Keep default limits consistent withsearch
search()
now defaults tolimit=15
, whilebatch_search()
keepslimit=5
. Consider aligning the defaults to avoid surprising behaviour for callers that switch between the two helpers.-async def batch_search( - self, - collection_name: str, - query_texts: List[str], - limit: int = 5, +async def batch_search( + self, + collection_name: str, + query_texts: List[str], + limit: int = 15,cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py (1)
430-444
: Mutable default argument detectedUsing
{}
as a default is shared across invocations.-def serialize_properties(self, properties=dict()): +def serialize_properties(self, properties: Optional[dict] = None): serialized_properties = {} + + if properties is None: + properties = {}[B006]
🧰 Tools
🪛 Ruff (0.8.2)
430-430: Do not use mutable data structures for argument defaults
Replace with
None
; initialize within function(B006)
🛑 Comments failed to post (14)
cognee/modules/observability/get_observe.py (1)
5-11: 🛠️ Refactor suggestion
New centralized observability function lacks handling for all observer types
This new function centralizes the logic for obtaining the observe decorator, which is a good practice. However, it only handles the
LANGFUSE
observer type and doesn't provide implementations for other types likeLLMLITE
orLANGSMITH
, nor does it include a default fallback.Add handling for all observer types defined in the
Observer
enum and include a default fallback:def get_observe(): monitoring = get_base_config().monitoring_tool if monitoring == Observer.LANGFUSE: from langfuse.decorators import observe return observe + elif monitoring == Observer.LLMLITE: + # Import and return the appropriate observe decorator + pass + elif monitoring == Observer.LANGSMITH: + # Import and return the appropriate observe decorator + pass + else: + # Return a no-op decorator as fallback + def noop_observe(*args, **kwargs): + def decorator(func): + return func + return decorator if callable(args[0]) else decorator + + return noop_observeThis ensures the function works for all configured observer types and provides a safe fallback.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents (early access)
In cognee/modules/observability/get_observe.py around lines 5 to 11, the get_observe function only handles the LANGFUSE observer type and lacks implementations for other observer types like LLMLITE and LANGSMITH, as well as a default fallback. Update the function to include conditional branches for each observer type defined in the Observer enum, importing and returning the appropriate observe decorator for each. Also, add a default fallback return value or decorator to ensure the function safely handles unexpected or unset observer types.
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)
227-236:
⚠️ Potential issueRow unpacking bug – attributes accessed on wrong object
session.execute(query)
returnsResult
rows shaped like
(<PGVectorDataPoint>, similarity)
.
Inside the loopvector
is the row, not the model instance, therefore
vector.id
andvector.payload
are undefined.- for vector in closest_items.all(): - vector_list.append( - { - "id": parse_id(str(vector.id)), - "payload": vector.payload, - "_distance": vector.similarity, - } - ) + for row in closest_items.all(): + data_point, similarity = row + vector_list.append( + { + "id": parse_id(str(data_point.id)), + "payload": data_point.payload, + "_distance": similarity, + } + )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.query = select( PGVectorDataPoint, PGVectorDataPoint.c.vector.cosine_distance(query_vector).label("similarity"), ).order_by("similarity") if limit > 0: query = query.limit(limit) # Find closest vectors to query_vector closest_items = await session.execute(query) for row in closest_items.all(): data_point, similarity = row vector_list.append( { "id": parse_id(str(data_point.id)), "payload": data_point.payload, "_distance": similarity, } )
🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py around lines 227 to 236, the result of session.execute(query) returns rows as tuples of (PGVectorDataPoint, similarity), but the code incorrectly treats each row as a PGVectorDataPoint instance. To fix this, unpack each row into its components (e.g., vector, similarity) before accessing attributes like id or payload on the vector object.
cognee-mcp/src/server.py (3)
46-48:
⚠️ Potential issueChain the original exception when re-raising
Static analysis (B904) correctly points out that the root exception is discarded:
except Exception as e: ... raise ValueError(f"Failed to cognify: {str(e)}")Using
raise … from e
preserves the traceback and simplifies debugging.- except Exception as e: - logger.error("Cognify process failed.") - raise ValueError(f"Failed to cognify: {str(e)}") + except Exception as e: + logger.error("Cognify process failed.") + raise ValueError(f"Failed to cognify: {e}") from e🧰 Tools
🪛 Ruff (0.8.2)
48-48: Within an
except
clause, raise exceptions withraise ... from err
orraise ... from None
to distinguish them from errors in exception handling(B904)
🤖 Prompt for AI Agents (early access)
In cognee-mcp/src/server.py around lines 46 to 48, the exception is re-raised without chaining the original exception, which discards the traceback. Modify the raise statement to use "raise ValueError(...) from e" to preserve the original exception context and traceback for better debugging.
111-120:
⚠️ Potential issueValidate
search_type
to preventKeyError
crashes
SearchType[search_type.upper()]
will throw aKeyError
for any unexpected value, crashing the tool.Add an explicit whitelist or catch the error:
- search_results = await cognee.search( - query_type=SearchType[search_type.upper()], query_text=search_query - ) + try: + query_type = SearchType[search_type.upper()] + except KeyError: + raise ValueError( + f"Unsupported search_type '{search_type}'. " + f"Allowed values: {[t.name.lower() for t in SearchType]}" + ) from None + + search_results = await cognee.search( + query_type=query_type, query_text=search_query + )Also applies to: 126-128
🤖 Prompt for AI Agents (early access)
In cognee-mcp/src/server.py around lines 111 to 120, the code uses SearchType[search_type.upper()] without validating search_type, which can cause a KeyError if the value is unexpected. To fix this, add explicit validation by checking if search_type.upper() is in the allowed SearchType keys before using it, or wrap the access in a try-except block to catch KeyError and handle it gracefully. Apply the same validation or error handling to lines 126-128 where SearchType is also accessed.
50-56: 🛠️ Refactor suggestion
Background tasks run “fire-and-forget” – add error handling to avoid silent failures
asyncio.create_task()
detaches the pipeline tasks, but unhandled exceptions inside them are never surfaced.
Attach a done-callback or wrap the whole task inasyncio.create_task(_task, name=…)
and inspecttask.exception()
to log failures.Example:
task = asyncio.create_task(cognify_task(...), name="cognify_task") task.add_done_callback( lambda t: logger.error("Cognify crashed: %s", t.exception()) if t.exception() else None )Also applies to: 88-94
🤖 Prompt for AI Agents (early access)
In cognee-mcp/src/server.py around lines 50 to 56, the asyncio.create_task call runs the cognify_task as a fire-and-forget background task without error handling, causing exceptions to be silently ignored. To fix this, assign the created task to a variable, give it a name, and add a done-callback that checks if the task raised an exception; if so, log the error using the logger. Apply the same pattern to the similar code block at lines 88 to 94.
cognee/infrastructure/databases/graph/networkx/adapter.py (1)
45-46:
⚠️ Potential issueType-hint drift from the abstract interface
GraphDBInterface.has_node()
still expectsnode_id: str
, but the adapter signature was changed toUUID
.
This breaks LSP compliance and will confuse call-sites still passing strings.Align the types across all adapters (or accept both and cast internally) and update the interface if the project is truly migrating to UUIDs.
🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/graph/networkx/adapter.py around lines 45 to 46, the method has_node uses node_id typed as UUID, but the abstract interface expects a string, causing a type-hint mismatch and breaking Liskov Substitution Principle compliance. To fix this, align the type hints by either changing the adapter method signature to accept a string as in the interface or update the interface to accept UUID if migrating to UUIDs. Alternatively, modify the adapter to accept both types and internally cast strings to UUID before use, ensuring consistent type usage across all adapters and the interface.
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (2)
43-47: 🛠️ Refactor suggestion
Avoid reconnecting on every call to
get_client
await self.client.connect()
is executed each timeget_client()
is called.
For long-lived adapters this results in unnecessary hand-shakes and connection churn, hurting latency and exhausting connection limits.async def get_client(self): - await self.client.connect() - return self.client + if not self.client.is_connected(): # pseudo–API – adjust to real attribute + await self.client.connect() + return self.clientCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py around lines 43 to 47, the get_client method calls await self.client.connect() every time it is invoked, causing unnecessary reconnections. Modify the method to check if the client is already connected before calling connect, or establish the connection once during adapter initialization and simply return the connected client in get_client without reconnecting.
116-122:
⚠️ Potential issueMissing
await
on async API – risk of silent coroutine
collection.data.exists
and the subsequent update/insert helpers are asynchronous in the async Weaviate client.
Calling them withoutawait
returns a coroutine, so the condition is always truthy, which breaks the upsert logic.-if collection.data.exists(data_point.uuid): - return await collection.data.update( +if await collection.data.exists(data_point.uuid): + return await collection.data.update(Apply the same pattern to
collection.data.insert
in theelse
branch if that method is also async.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.if await collection.data.exists(data_point.uuid): return await collection.data.update( uuid=data_point.uuid, vector=data_point.vector, properties=data_point.properties, references=data_point.references, )
🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py around lines 116 to 122, the calls to collection.data.exists and collection.data.update are asynchronous but missing the await keyword, causing them to return coroutines and break the upsert logic. Add await before collection.data.exists and collection.data.update calls to properly await their completion. Also, check if collection.data.insert in the else branch is async and add await there as well to ensure correct asynchronous behavior.
cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py (6)
255-258:
⚠️ Potential issueRelationship type accessed via indexing – use
.type
attributeNeo4j driver returns
neo4j.graph.Relationship
; subscripting it (rel[1]
) is undefined behaviour and will raiseTypeError
.
Retrieve the type withrel.type
(ortype(rel)
in Memgraph).-(result["n"]["id"], result["m"]["id"], {"relationship_name": result["r"][1]}) +( + result["n"]["id"], + result["m"]["id"], + {"relationship_name": result["r"].type}, +)Apply the same fix in
get_connections
.Also applies to: 387-393
🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around lines 255 to 258, the code accesses the relationship type using indexing (result["r"][1]), which is incorrect and causes a TypeError. Replace this with accessing the relationship type via the .type attribute (result["r"].type). Also, apply the same fix in the get_connections method and similarly update lines 387 to 393 where this pattern occurs.
67-78: 🛠️ Refactor suggestion
Dynamic labels are not parameterisable
Cypher (incl. Memgraph) does not allow label names to be supplied via parameters (e.g.
node:$node_label
).
This query will fail at compile time.Replace with string interpolation that safely whitelists labels:
label = type(node).__name__ if not re.fullmatch(r"[A-Za-z_][A-Za-z0-9_]*", label): raise ValueError(f"Invalid label: {label}") query = f""" MERGE (node {{id: $node_id}}) ON CREATE SET node:`{label}`, node += $properties, node.updated_at = timestamp() ON MATCH SET node:`{label}`, node += $properties, node.updated_at = timestamp() RETURN ID(node) AS internal_id, node.id AS nodeId """🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around lines 67 to 78, the Cypher query incorrectly uses a parameter for the node label, which is not supported and causes a compile-time error. Fix this by extracting the label as a string from the node's type name, validating it against a whitelist regex to ensure it is a safe label, and then use Python string interpolation to insert the label directly into the query string instead of using a parameter. Keep other parameters like node_id and properties as query parameters.
414-417:
⚠️ Potential issueInterpolating
edge_label
andid
directly into labels creates invalid Cypher & risk of injectionMATCH (node:`{id}`)<-[r:{edge_label}]-(successor)
id
is undefined, and label injection is unsafe.Refactor to parameterised properties instead of dynamic labels:
UNWIND $node_ids AS nid MATCH (node {id: nid})<-[r]-(successor) WHERE type(r) = $edge_label DELETE r🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around lines 414 to 417, the Cypher query incorrectly interpolates `id` and `edge_label` directly into labels, causing invalid syntax and injection risks. Replace dynamic label usage with parameterized properties by matching nodes with a property filter (e.g., `node {id: nid}`) and filter relationships by type using a WHERE clause with a parameter for `edge_label`. Adjust the query to use UNWIND with parameterized node IDs and pass `edge_label` as a parameter to ensure safe and valid Cypher syntax.
122-124:
⚠️ Potential issueMalformed Cypher pattern – double curly braces
MATCH (node: {{id: $node_id}})
is invalid syntax.
Probably meant to match on a property:-query = "MATCH (node: {{id: $node_id}}) DETACH DELETE node" +query = "MATCH (node {id: $node_id}) DETACH DELETE node"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.query = "MATCH (node {id: $node_id}) DETACH DELETE node" params = {"node_id": sanitized_id}
🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around lines 122 to 124, the Cypher query uses double curly braces in the pattern MATCH (node: {{id: $node_id}}), which is invalid syntax. Replace the double curly braces with a property match syntax by using MATCH (node {id: $node_id}) to correctly match nodes with the given id property.
3-5:
⚠️ Potential issue
ERROR
symbol not exported fromlogging_utils
cognee.shared.logging_utils
only exposesget_logger
; importingERROR
raisesImportError
at import time.-from cognee.shared.logging_utils import get_logger, ERROR +import logging +from cognee.shared.logging_utils import get_loggerand call
get_logger("MemgraphAdapter", level=logging.ERROR)
instead.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.import json import logging from cognee.shared.logging_utils import get_logger import asyncio
🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py lines 3 to 5, the import of ERROR from cognee.shared.logging_utils is invalid because ERROR is not exported. Remove the import of ERROR and instead import the standard logging module. Then, when calling get_logger, pass the log level as logging.ERROR using the standard logging module.
400-405: 🛠️ Refactor suggestion
Undefined variables injected into f-string
Inside the f-string
{id: nid}
tries to evaluateid
(the built-in function) andnid
(undefined).
Use double braces or parameters instead:-query = f""" -UNWIND $node_ids AS nid -MATCH (node {id: nid})-[r]->(predecessor) -WHERE type(r) = $edge_label -DELETE r; -""" +query = """ +UNWIND $node_ids AS nid +MATCH (node {id: nid})-[r]->(predecessor) +WHERE type(r) = $edge_label +DELETE r +"""🤖 Prompt for AI Agents (early access)
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around lines 400 to 405, the f-string incorrectly injects undefined variables by using {id: nid}, which tries to evaluate the built-in id function and an undefined variable nid. To fix this, replace the f-string expression with parameterized query syntax or use double braces to escape the curly braces so that the query treats {id: nid} as a literal map pattern rather than Python expressions.
@borisarzentar neat caching, might still make sense to leave a bit of a documentation on which extra is what -- I mean it could also be in |
FROM python:3.11-slim | ||
|
||
# Define Poetry extras to install | ||
ARG POETRY_EXTRAS="\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to define the Dockerfile used extras at the beginning of the Dockerfile like before, it will make it easier for updating/maintaining later on. Especially since extras are now called in two different places in the Dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't pass all extras at once, like we did with poetry. Here we have to pass one by one, so I would have to iterate somehow through extras and add each separately. I don't want to spend time on that now, we can change it later.
@@ -127,6 +126,7 @@ dev = [ | |||
"mkdocs-minify-plugin>=0.8.0,<0.9", | |||
"mkdocstrings[python]>=0.26.2,<0.27", | |||
] | |||
debug = ["debugpy==1.8.9"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add this under dev dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was there, but sometimes I need just debugpy
and nothing else. And in case of Docker container, we have a debug option there so we install now only this extra instead of the whole dev.
|
||
if not data: | ||
data: list[Data] = await get_dataset_data(dataset_id=dataset_id) | ||
|
||
# async with update_status_lock: TODO: Add UI lock to prevent multiple backend requests | ||
if isinstance(dataset, Dataset): | ||
task_status = await get_pipeline_status([dataset_id]) | ||
task_status = await get_pipeline_status([dataset_id], pipeline_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a breaking change for cognee-mcp checking of pipeline statuses. As there both add and cognify are under one pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are not technically one pipeline, they are just called from one function. But two pipelines will run. What do you think will break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't initially see cognee-mcp was updated as well, you can disregard this comment
@@ -144,7 +144,9 @@ async def cognify_status(): | |||
"""Get status of cognify pipeline""" | |||
with redirect_stdout(sys.stderr): | |||
user = await get_default_user() | |||
status = await get_pipeline_status([await get_unique_dataset_id("main_dataset", user)]) | |||
status = await get_pipeline_status( | |||
[await get_unique_dataset_id("main_dataset", user)], "cognify_pipeline" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the add pipeline status? That also needs to be checked along with cognify for Cognee-mcp cognify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When both are complete the status should be complete and if the add pipeline is working status needs to be processing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen now if the user would check the status while the add pipeline is running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In mcp add
and cognify
are merged into one, so I think that checking the cognify_pipeline
status should be enough. Add happens usually very quickly, so I'm not sure if that can happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the users check right away I it will most likely happen, which is pretty realistic for them to do. Would be great to treat the return status as one pipeline then as well by checking state of both and handling the return based on combined states (or something like this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extension of scope or unclear scope, let's create a ticket for this one and merge with existing functionality
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.