-
Notifications
You must be signed in to change notification settings - Fork 375
proposal: Node Resource Balance Rescheduling (#2332) #2341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2341 +/- ##
=======================================
Coverage 65.94% 65.94%
=======================================
Files 466 466
Lines 54879 54879
=======================================
Hits 36190 36190
Misses 16074 16074
Partials 2615 2615
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
It consistency with the scheduler's **NodeResourcesBalancedAllocation** strategy and provides scalability for incorporating additional resource types in the future. | ||
|
||
A node is considered to have excessive fragmentation when its fragmentation rate exceeds a threshold, which may adversely affect future pod scheduling on that node. | ||
To simplify user configuration, the plugin autonomously computes the mean (μ) and standard deviation (σ) of fragmentation rates across the cluster, dynamically setting the threshold at μ + σ. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a theoretical basis to set μ + σ as default or just an experienced value?
- KoordinatorQoSClass | ||
- PodDeletionCost | ||
- EvictionCost | ||
- NodeFragmentationRate (the node fragmentation rate is calculated under the hypothetical scenario of pod eviction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this criteria, we need more details. For example, the Pod that result in the largest decrease in the node's fragmentation rate is evicted first.
|
||
#### Pod Selection | ||
The plugin receives externally filter to evaluate and classify Pods on a node into two distinct categories: removable Pods and non-removable Pods. | ||
Subsequently, a secondary filtering process is applied to the removable Pods, which eliminates those whose eviction would result in an increased fragmentation rate on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, we need to judge if there is a suitable target node to place the evicted Pod. First, the target node must have enough available CPU and memory resources to accommodate the evicted Pod, this can refer to LowNodeLoad plugin inside koord-descheduler. Second, after placing the Pod on the target node, the node’s resource allocation should not become fragmented.
I feel this is a little bit complicated, or maybe I'm not very clear-minded. We can discuss further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside this proposal, let's discuss all details clearly. Then we can implement only the most important features inside the first version of code.
For example, suppose node A has a CPU allocation rate of 90% and a memory allocation rate of 50%, while node B has a CPU allocation rate of 50% and a memory allocation rate of 90%. | ||
If a pod requests 15% CPU and 15% memory of the node, the pod may fail to be scheduled, even though the total resources on node A and node B are sufficient. | ||
Such situations should be avoided. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a figure here will be more clear. I'll fix the one that I comment in the issue before and try to add it here later.
@@ -0,0 +1,87 @@ | |||
--- | |||
title: Node Resource Balance Rescheduling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of rescheduling, use descheduling here is better.
This issue has been automatically marked as stale because it has not had recent activity.
|
Ⅰ. Describe what this PR does
Propose Node Resource Balance Rescheduling
Ⅱ. Does this pull request fix one issue?
fixes #2332
Ⅲ. Describe how to verify it
Ⅳ. Special notes for reviews
V. Checklist