What's New

@JesseStutler

What's Changed

Automated cherry pick of #4422: Move kube-scheduler related metrics initilization to server.go to avoid panic by @JesseStutler in #4461
Automated cherry pick of #4473: fix node count reconcile by @Monokaix in #4488
[cherry-pick for 1.12]Fix incorrect definition of ReleaseNameEnvKey by @ouyangshengjia in #4490
[cherry-pick for 1.12]Fix the issue where SelectBestNode returns nil when plugin scores are negative by @guoqinwill in #4472
Automated cherry pick of #4487: Add missing capacity metrics in hierarchical queues by @JesseStutler in #4494
[Cherry-pick] Add bump version script; Make version release more automated by @JesseStutler in #4521
[Cherry-pick] fix: update podGroup when statefulSet update by @Poor12 in #4522
Automated: Bump version to v1.12.2 by @JesseStutler in #4518

Full Changelog: v1.12.1...v1.12.2

@Monokaix

What's Changed

Fix queue update conflicts when upgrading to new version by @Monokaix in #4336
Bump image to v1.12.1 by @Monokaix in #4337

Full Changelog: v1.12.0...v1.12.1

What's New

Welcome to the v1.12.0 release of Volcano! 🚀 🎉 📣
In this release, we have brought a bunch of significant enhancements that have long-awaited by community users.

Network Topology Aware Scheduling: Alpha Release

Volcano's network topology-aware scheduling, initially introduced as a preview in v1.11, has now reached its Alpha release in v1.12. This feature aims to optimize the deployment of AI tasks in large-scale training and inference scenarios, such as model parallel training and Leader-Worker inference. It achieves this by scheduling tasks within the same network topology performance domain, which reduces cross-switch communication and significantly enhances task efficiency. Volcano leverages the HyperNode CRD to abstract and represent heterogeneous hardware network topologies, supporting a hierarchical structure for simplified management.

Key features integrated in v1.12 include:

HyperNode Auto-Discovery: Volcano now offers automatic discovery of cluster network topologies. Users can configure the discovery type, and the system will automatically create and maintain hierarchical HyperNodes that reflect the actual cluster network topology. Currently, this supports InfiniBand (IB) networks by acquiring topology information via the UFM (Unified Fabric Manager) interface and automatically updating HyperNodes. Future plans include support for more network protocols like RoCE.
Prioritized HyperNode Selection:

This release introduces a scoring strategy based on both node-level and HyperNode-level evaluations, which are accumulated to determine the final HyperNode score.
- Node-level: It is recommended to configure the BinPack plugin to prioritize filling HyperNodes, thereby reducing resource fragmentation.
- HyperNode-level: Lower-level HyperNodes are preferred for better performance due to fewer cross-switch communications. For HyperNodes at the same level, those containing more tasks receive higher scores to reduce HyperNode-level resource fragmentation.
Support for Label Selector Node Matching:

HyperNode leaf nodes are associated with physical nodes in the cluster, supporting three matching strategies:
- Exact Match: Direct matching of node names.
- Regex Match: Matching node names using regular expressions.
- Label Match: Matching nodes via standard Label Selectors.

Related Documentation:

Related PRs: (#3874, #3894, #3969, #3971, #4068, #4213, #3897, #3887, @ecosysbin, @weapons97, @Xu-Wentao,@penggu @JesseStutler, @Monokaix)

Dynamic MIG Slicing for GPU Virtualization

Volcano's GPU virtualization feature now supports requesting partial GPU resources by memory and compute capacity. This, combined with Device Plugin integration, achieves hardware isolation and improves GPU utilization.

Traditional GPU virtualization restricts GPU usage by intercepting CUDA APIs (based on HAMI-Core software solutions). NVIDIA Ampere architecture introduced MIG (Multi-Instance GPU) technology, allowing a single physical GPU to be partitioned into multiple independent instances. However, general MIG solutions often pre-fix instance sizes, leading to resource waste and insufficient flexibility.

Volcano v1.12 provides dynamic MIG slicing and scheduling capabilities. It can select appropriate MIG instance sizes in real-time based on the user's requested GPU usage and employs a Best-Fit algorithm to minimize resource waste. It also supports GPU scoring strategies like BinPack and Spread to reduce resource fragmentation and enhance GPU utilization. Users can request resources using the unified volcano.sh/vgpu-number, volcano.sh/vgpu-cores, and volcano.sh/vgpu-memory APIs without needing to concern themselves with the underlying implementation.

Related Documentation:

Related PRs: (#4290, #3953, @sailorvii, @archlitchi)

Dynamic Resource Allocation (DRA) Support

Kubernetes DRA (Dynamic Resource Allocation) is a built-in Kubernetes feature designed to provide a more flexible and powerful way to manage heterogeneous hardware resources in a cluster, such as GPUs, FPGAs, and high-performance network cards. It addresses the limitations of traditional Device Plugins in certain advanced scenarios, enabling device vendors and platform administrators to better declare, allocate, and share these hardware resources with Pods and containers.

Volcano v1.12 adds support for DRA. This feature allows the cluster to dynamically allocate and manage external resources, enhancing Volcano's integration with the Kubernetes ecosystem and its resource management flexibility.

Related Documentation:
Unified Scheduling with DRA

Related PR: (#3799, @JesseStutler)

Volcano Global Supports Queue Capacity Management

Queues are a fundamental concept in Volcano. To enable tenant quota management in multi-cluster and multi-tenant environments, Volcano v1.12 introduces enhanced global queue capacity management. Users can now centrally limit tenant resource usage across multiple clusters. The configuration remains consistent with single-cluster setups: tenant quotas are defined by setting the capability field within the queue configuration.

Related PR: volcano-sh/volcano-global#16 (@tanberBro)

Security Enhancements

The Volcano community consistently focuses on security. In v1.12, beyond fine-grained control over sensitive permissions like ClusterRole, we've addressed and fixed the following potential security risks:

HTTP Server Timeout Settings: Metric and Healthz endpoints for all Volcano components have been configured with server-side ReadHeader, Read, and Write timeouts. This prevents prolonged resource occupation.
- PR: #4208
Warning Logs for Skipping SSL Certificate Verification: When client requests set insecureSkipVerify to true, a warning log is now added. We strongly advise enabling SSL certificate verification in production environments.
- PR: #4211
Volcano Scheduler pprof Endpoint Disabled by Default: To prevent the disclosure of sensitive program information, the Profiling data port (used for troubleshooting) is now disabled by default.
- PR: #4173
Removal of Unnecessary File Permissions: Unnecessary execution permissions have been removed from Go source files to maintain minimal file permissions.
- PR: #4171
Security Context and Non-Root Execution for Containers: All Volcano components now run with non-root privileges. We've added seccompProfile, SELinuxOptions, and set allowPrivilegeEscalation to false to prevent container privilege escalation. Additionally, only necessary Linux Capabilities are retained, comprehensively limiting container permissions.
- PR: #4207
HTTP Request Response Body Size Limit: For HTTP requests sent by the Extender Plugin and Elastic Search Service, their response body size is now limited. This prevents excessive resource consumption that could lead to OOM (Out Of Memory) issues.
- Disclosure: GHSA-hg79-fw4p-25p8

Performance Improvements in Large-Scale Scenarios

Volcano continuously optimizes performance. The new version, without affecting functionality, has by default removed and disabled some unnecessary Webhooks, improving performance in large-scale batch creation scenarios:

PodGroup Mutating Webhook Disabled by Default: When creating a PodGroup without specifying a queue, the system can now read from the Namespace to populate it. Since this scenario is uncommon, this Webhook is disabled by default. Users can enable it as needed.
Queue Status Validation Moved from Pod to PodGroup: When a queue is closed, task submission is disallowed. The original validation logic was performed during Pod creation. As Volcano's basic scheduling unit is PodGroup, migrating the validation to PodGroup creation is more logical. Since the number of PodGroups is less than Pods, this reduces Webhook calls, improving perfo...

@kevin-wangzefeng

Important:
This release addresses multiple critical security vulnerabilities. We strongly advise all users to upgrade to immediately to protect your systems and data.

Security Fixes

[Cherry-pick 1.11] Add http response body size limit (#4252 @kevin-wangzefeng )
[Cherry-pick 1.11] Add security context configuration (#4245 @JesseStutler)
Remove the execute permission for some files, chmod to 644 (#4171 @JesseStutler)
add a switch to control whether enable pprof in scheduler (#4173 @JesseStutler)
Add warning msg when TLS verification disabled(#4211 @Monokaix)
Add http server timeout(#4208 @Monokaix)

Other Improvements

Bump image to v1.11.2 (#4232 @JesseStutler)
Fix: remove controller-manager metrics that should not be introduced (#4202 @dongjiang1989)
Filter useless logs in binpack (#4240 @XbaoWu)

Important Notes Before Upgrading

Change: Volcano Scheduler pprof Endpoint Disabled by Default
For security enhancement, the pprof endpoint for the Volcano Scheduler is now disabled by default in this release. If you require this endpoint for debugging or monitoring, you will need to explicitly enable it post-upgrade. This can be achieved by:

If you are using helm, specifying custom.scheduler_pprof_enable=true during Helm installation or upgrade.
OR, manually setting the command-line argument --enable-pprof=true when starting the Volcano Scheduler.

Please be aware of the security implications before enabling this endpoint in production environments.

@JesseStutler

Important
This release addresses multiple critical security vulnerabilities. We strongly advise all users to upgrade immediately to protect your systems and data.

Security Fixes

[Cherry-pick network-topology] Add http response body size limit (#4255 @JesseStutler)
[Cherry-pick network-topology] Add security context configuration (#4250 @JesseStutler)
Remove the execute permission for some files, chmod to 644 (#4171 @JesseStutler)
add a switch to control whether enable pprof in scheduler (#4173 @JesseStutler)
Add warning msg when TLS verification disabled(#4211 @Monokaix)
Add http server timeout(#4208 @Monokaix)

Other Improvements

Bump image to v1.11.0-network-topology-preview.3 (#4237 @JesseStutler)
Add NetworkTopology plugin score doc (#4213 @ecosysbin)
HyperNode supports select Nodes By labels (#4068 @ecosysbin)
Update ubuntu base image (#4197 @Monokaix)

Important Notes Before Upgrading

Change: Volcano Scheduler pprof Endpoint Disabled by Default
For security enhancement, the pprof endpoint for the Volcano Scheduler is now disabled by default in this release. If you require this endpoint for debugging or monitoring, you will need to explicitly enable it post-upgrade. This can be achieved by:

If you are using helm, specifying custom.scheduler_pprof_enable=true during Helm installation or upgrade.
OR, manually setting the command-line argument --enable-pprof=true when starting the Volcano Scheduler.

Please be aware of the security implications before enabling this endpoint in production environments.

@kevin-wangzefeng

Important:
This release addresses multiple critical security vulnerabilities. We strongly advise all users to upgrade immediately to protect your systems and data.

Security Fixes

[Cherry-pick 1.10] Add http response body size limit (#4253 @kevin-wangzefeng)
[Cherry-pick 1.10] Add security context configuration (#4246 @JesseStutler)
Remove the execute permission for some files, chmod to 644 (#4171 @JesseStutler)
add a switch to control whether enable pprof in scheduler (#4173 @JesseStutler)
Add warning msg when TLS verification disabled(#4211 @Monokaix)
Add http server timeout(#4208 @Monokaix)

Other Improvements

Update ubuntu base image(#4194 @Monokaix)
Bump image to v1.10.2 (#4231 @JesseStutler)

Important Notes Before Upgrading

Change: Volcano Scheduler pprof Endpoint Disabled by Default
For security enhancement, the pprof endpoint for the Volcano Scheduler is now disabled by default in this release. If you require this endpoint for debugging or monitoring, you will need to explicitly enable it post-upgrade. This can be achieved by:

If you are using helm, specifying custom.scheduler_pprof_enable=true during Helm installation or upgrade.
OR, manually setting the command-line argument --enable-pprof=true when starting the Volcano Scheduler.

Please be aware of the security implications before enabling this endpoint in production environments.

@JesseStutler

Important:
This release addresses multiple critical security vulnerabilities. We strongly advise all users to upgrade immediately to protect your systems and data.

Security Fixes

[Cherry-pick 1.9] Add http response body size limit (#4254 @JesseStutler)
[Cherry-pick 1.9] Add security context configuration (#4249 @Monokaix)
Remove the execute permission for some files, chmod to 644 (#4171 @JesseStutler)
add a switch to control whether enable pprof in scheduler (#4173 @JesseStutler)
Add warning msg when TLS verification disabled(#4211 @Monokaix)
Add http server timeout(#4208 @Monokaix)

Other Improvements

Bump image to v1.9.1 (#4230 @JesseStutler)
fix panic when get job's elastic resource (#4103 @lowang-bh)
change to action cache v4 (#4075 @Monokaix)
fix flaky test (#4121 @Monokaix)
Supports rollback when allocate callback function fails (#3780 @wangyang0616)
Supports rollback when allocate callback function fails (#3776 @wangyang0616)
fix pg controller create redundancy podGroup when schedulerName isn't matched (#3675 @liuyuanchun11)
Update Kubernetes compatibility (#3570 @Monokaix)
Fix podgroup not created (#3572 @liuyuanchun11)
update pod status when bind error (#3550 @bibibox)
Update NominatedNodeName for pipelined task (#3501 @bibibox)

Important Notes Before Upgrading

Change: Volcano Scheduler pprof Endpoint Disabled by Default
For security enhancement, the pprof endpoint for the Volcano Scheduler is now disabled by default in this release. If you require this endpoint for debugging or monitoring, you will need to explicitly enable it post-upgrade. This can be achieved by:

If you are using helm, specifying custom.scheduler_pprof_enable=true during Helm installation or upgrade.
OR, manually setting the command-line argument --enable-pprof=true when starting the Volcano Scheduler.

Please be aware of the security implications before enabling this endpoint in production environments.

@Monokaix

What's Changed

[cherry-pick]change to action cache v4 by @Monokaix in #4074
[Cherry-pick network-topology] Replace queue status update by using ApplyStatus method & Bump image to v1.11.0-network-topology-preview.2 by @JesseStutler in #4153

Full Changelog: v1.11.0-network-topology-preview.0...v1.11.0-network-topology-preview.2

@Monokaix

What's Changed

[cherry-pick]change to action cache v4 by @Monokaix in #4075
[cherry-pick]fix creating a hierarchical sub-queue will be rejected by @zhutong196 in #4080
[cherry-pick] Fix jobflow status confusion problem by @dongjiang1989 in #4094
[cherry-pick] fix: the problem that PVC will be continuously created indefinitely by @ytcisme in #4144
[Cherry-pick v1.11] Replace queue status update by using ApplyStatus method & Bump image to v1.11.1 by @JesseStutler in #4155
[Cherry-pick v1.11] fix: remove lessPartly condition in reclaimable fn from capacity and proportion plugins by @JesseStutler in #4178

Full Changelog: v1.11.0...v1.11.1

@liuyuanchun11

What's Changed

[cherry-pick for release-1.10]fix job controller reports duplicate warnings by @liuyuanchun11 in #3755
[cherry-pick for release 1.10] Fix predicate return unexpected result by @bibibox in #3859
[cherry-pick for release-1.10]Supports rollback when allocate callback function fails by @wangyang0616 in #3864
change to action cache v4 by @Monokaix in #4096
[cherry-pick] Fix jobflow status confusion problem by @dongjiang1989 in #4095
[cherry-pick] fix panic when get job's elastic resource by @lowang-bh in #4103
Update release-1.10 api version to v1.10.1 and Bump image to v1.10.1 by @JesseStutler in #4154

Full Changelog: v1.10.0...v1.10.1

Releases: volcano-sh/volcano

v1.12.2

What's Changed

Contributors

Uh oh!

v1.12.1

What's Changed

Contributors

Uh oh!

v1.12.0

What's New

Network Topology Aware Scheduling: Alpha Release

Dynamic MIG Slicing for GPU Virtualization

Dynamic Resource Allocation (DRA) Support

Volcano Global Supports Queue Capacity Management

Security Enhancements

Performance Improvements in Large-Scale Scenarios

Contributors

Uh oh!

v1.11.2

Security Fixes

Other Improvements

Important Notes Before Upgrading

Contributors

Uh oh!

v1.11.0-network-topology-preview.3

Security Fixes

Other Improvements

Important Notes Before Upgrading

Contributors

Uh oh!

v1.10.2

Security Fixes

Other Improvements

Important Notes Before Upgrading

Contributors

Uh oh!

v1.9.1

Security Fixes

Other Improvements

Important Notes Before Upgrading

Contributors

Uh oh!

v1.11.0-network-topology-preview.2

What's Changed

Contributors

Uh oh!

v1.11.1

What's Changed

Contributors

Uh oh!

v1.10.1

What's Changed

Contributors

Uh oh!