Skip to content

fix: avoid extra node creation on informercache delay #1311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 25, 2025

Conversation

zhuangqh
Copy link
Collaborator

Reason for Change:

in scenarios where the informercache experiences significant delay,
the controller may mistakenly create redundant nodes due to stale state.

this change leverages the expectations mechanism. The controller now
only reconciles when expectations are fully satisfied, ensuring observed
state is sufficiently up-to-date.

Requirements

  • added unit tests and e2e tests (if applicable).

Issue Fixed:

Fixes #1300

Notes for Reviewers:

reference: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replicaset/replica_set.go

in scenarios where the informercache experiences significant delay,
the controller may mistakenly create redundant nodes due to stale state.

this change leverages the expectations mechanism. The controller now
only reconciles when expectations are fully satisfied, ensuring observed
state is sufficiently up-to-date.

Signed-off-by: zhuangqh <zhuangqhc@gmail.com>
Copy link

Title

Avoid extra node creation on informercache delay


Description

  • Added expectations mechanism to handle informercache delays

  • Ensured controller only reconciles when expectations are fully satisfied

  • Created new pkg/utils/controller.go for managing expectations

  • Updated workspace_controller.go to use expectations and handle node creation events


Changes walkthrough 📝

Relevant files
Enhancement
controller.go
Added ControllerExpectations implementation                           

pkg/utils/controller.go

  • Added new file with ControllerExpectations implementation
  • Implemented methods for setting, observing, and managing expectations
  • +229/-0 
    workspace_event_handler.go
    Added nodeClaimEventHandler                                                           

    pkg/workspace/controllers/workspace_event_handler.go

  • Added new file with nodeClaimEventHandler implementation
  • Implemented methods for handling NodeClaim events
  • +86/-0   
    Bug fix
    workspace_controller.go
    Integrated expectations into WorkspaceReconciler                 

    pkg/workspace/controllers/workspace_controller.go

  • Integrated ControllerExpectations into WorkspaceReconciler
  • Modified Reconcile method to check and wait for expectations
  • Updated createNewNodes to handle TerminalError
  • Removed unused getNewNodes function
  • Updated SetupWithManager to use custom nodeClaimEventHandler
  • +64/-77 
    Refactoring
    workspace_controller_test.go
    Removed outdated test                                                                       

    pkg/workspace/controllers/workspace_controller_test.go

    • Removed outdated TestCreateAndValidateNodeClaimNode test
    +0/-123 

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Possible Issue

    The createNewNodes function returns a TerminalError which might cause the reconciliation loop to stop prematurely if there are other tasks to be done after creating nodes.

    }
    if updateErr := c.updateStatusConditionIfNotMatch(ctx, wObj, kaitov1beta1.WorkspaceConditionTypeSucceeded, metav1.ConditionFalse,
    Code Duplication

    The logic for handling NodeClaims in createAllNodeClaims and nodeClaimEventHandler.Create seems similar. Consider refactoring to avoid duplication.

    for range count {
    	var newNodeClaim *karpenterv1.NodeClaim
    
    	// Set expectations before creating the NodeClaim incase event handler
    	// observes the NodeClaim creation event before the controller does.
    	c.expectations.ExpectCreations(c.klogger, client.ObjectKeyFromObject(wObj).String(), 1)
    	newNodeCreated := false
    	err := retry.OnError(retry.DefaultRetry, func(err error) bool {
    		return apierrors.IsAlreadyExists(err)
    	}, func() error {
    		newNodeClaim = nodeclaim.GenerateNodeClaimManifest(nodeOSDiskSize, wObj)
    		err0 := nodeclaim.CreateNodeClaim(ctx, newNodeClaim, c.Client)
    		if err0 == nil {
    			newNodeCreated = true
    		}
    		return err0
    	})
    	if !newNodeCreated {
    		// Decrement the expected number of creates because the informer won't observe this nodeclaim
    		c.expectations.CreationObserved(c.klogger, client.ObjectKeyFromObject(wObj).String())
    	}
    
    	if err != nil {
    		klog.ErrorS(err, "failed to create nodeClaim", "nodeClaim", newNodeClaim.Name)
    		return nil, fmt.Errorf("failed to create nodeClaim %s: %w", newNodeClaim.Name, err)
    	}
    	klog.InfoS("NodeClaim created successfully", "nodeClaim", newNodeClaim.Name, "workspace", klog.KObj(wObj))
    
    	nodeClaims = append(nodeClaims, newNodeClaim)
    }
    Removed Tests

    The test TestCreateAndValidateNodeClaimNode has been removed. Ensure that the functionality covered by this test is still being tested elsewhere.

    t.Errorf("%s: selected Nodes %+v are different from the expected %+v", k, selectedNodesArray, tc.expected)

    Copy link

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix incorrect loop syntax

    The for range count loop is incorrect. It should iterate over a range of numbers,
    not directly over count.

    pkg/workspace/controllers/workspace_controller.go [483]

    -for range count {
    +for i := 0; i < count; i++ {
    Suggestion importance[1-10]: 8

    __

    Why: The for range count loop is incorrect and will cause a compilation error. Changing it to for i := 0; i < count; i++ fixes the syntax and ensures proper iteration.

    Medium
    Correct expectation decrement logic

    Decrementing expectations when newNodeCreated is false is incorrect. This should
    only happen if the node creation was successful.

    pkg/workspace/controllers/workspace_controller.go [501-503]

    -if !newNodeCreated {
    -    // Decrement the expected number of creates because the informer won't observe this nodeclaim
    +if newNodeCreated {
         c.expectations.CreationObserved(c.klogger, client.ObjectKeyFromObject(wObj).String())
     }
    Suggestion importance[1-10]: 8

    __

    Why: Decrementing expectations when newNodeCreated is false is incorrect. The decrement should only occur if the node creation was successful.

    Medium
    General
    Use RequeueAfter instead of TerminalError

    Returning a TerminalError here might prevent further reconciliation. Consider
    returning a reconcile.Result with RequeueAfter instead.

    pkg/workspace/controllers/workspace_controller.go [370]

    -return reconcile.TerminalError(errors.New("waiting for new nodes to be created"))
    +requeueTime := time.Second * 5
    +return reconcile.Result{RequeueAfter: requeueTime}, nil
    Suggestion importance[1-10]: 7

    __

    Why: Returning a TerminalError stops further reconciliation, which might not be the desired behavior. Using RequeueAfter allows the controller to retry after a specified duration.

    Medium
    Update condition to reflect success

    The condition type and message seem inconsistent. If nodeclaims are created
    successfully, the condition should reflect success, not failure.

    pkg/workspace/controllers/workspace_controller.go [529-533]

    -if updateErr := c.updateStatusConditionIfNotMatch(ctx, wObj, kaitov1beta1.ConditionTypeResourceStatus, metav1.ConditionFalse,
    -    "workspaceResourceCreated", "nodeclaims created"); updateErr != nil {
    +if updateErr := c.updateStatusConditionIfNotMatch(ctx, wObj, kaitov1beta1.ConditionTypeResourceStatus, metav1.ConditionTrue,
    +    "workspaceResourceCreated", "nodeclaims created successfully"); updateErr != nil {
         klog.ErrorS(updateErr, "failed to update workspace status", "workspace", klog.KObj(wObj))
         return updateErr
     }
    Suggestion importance[1-10]: 7

    __

    Why: The condition type and message should reflect the success of nodeclaim creation. Changing the condition to ConditionTrue and updating the message to indicate success improves clarity.

    Medium

    Copy link
    Collaborator

    @chewong chewong left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM

    Signed-off-by: zhuangqh <zhuangqhc@gmail.com>
    Signed-off-by: zhuangqh <zhuangqhc@gmail.com>
    Signed-off-by: zhuangqh <zhuangqhc@gmail.com>
    @zhuangqh
    Copy link
    Collaborator Author

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    Category **Suggestion                                                                                                                                    ** Impact
    Possible issue
    Fix incorrect loop syntax
    Medium
    Correct expectation decrement logic
    Medium
    General
    Use RequeueAfter instead of TerminalError
    Medium
    Update condition to reflect success
    Medium

    adopt suggestion No.4

    @rambohe-ch
    Copy link
    Collaborator

    /LGTM

    @Fei-Guo Fei-Guo requested a review from Copilot July 25, 2025 05:11
    Copy link
    Contributor

    @Copilot Copilot AI left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Pull Request Overview

    This PR implements an expectations mechanism to prevent duplicate node creation when the informer cache experiences delays. The controller now uses expectations to track pending NodeClaim creations and only proceeds with reconciliation when expectations are satisfied, ensuring the observed state is up-to-date before making scaling decisions.

    Key changes include:

    • Implementation of a ControllerExpectations system based on Kubernetes patterns
    • Modified workspace controller to use expectations before creating new nodes
    • Updated event handling to track NodeClaim creation events and satisfy expectations

    Reviewed Changes

    Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

    File Description
    pkg/workspace/controllers/workspace_event_handler.go New event handler that tracks NodeClaim creation events to satisfy controller expectations
    pkg/workspace/controllers/workspace_controller_test.go Removes obsolete test for node creation validation logic
    pkg/workspace/controllers/workspace_controller.go Integrates expectations mechanism and modifies node creation flow to prevent duplicates
    pkg/utils/controller.go Implements ControllerExpectations utility copied from Kubernetes core

    @zhuangqh
    Copy link
    Collaborator Author

    The e2e test passed 3 times.

    @zhuangqh zhuangqh merged commit 76e099e into kaito-project:main Jul 25, 2025
    12 checks passed
    @zhuangqh zhuangqh deleted the fix-informercache-delay branch July 25, 2025 05:50
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    Status: Done
    Development

    Successfully merging this pull request may close these issues.

    bug: create too many nodes when informercache latency is high
    4 participants