-
Notifications
You must be signed in to change notification settings - Fork 99
Add Partition Support in Job Orchestrator #4800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Caution Review failedThe pull request is closed. WalkthroughThe pull request introduces a comprehensive enhancement to the job scheduling system by adding support for partitioned executions. The changes span multiple files in the orchestrator scheduler package, focusing on adding a Changes
Sequence DiagramsequenceDiagram
participant Job
participant Scheduler
participant Execution
Job->>Scheduler: Request execution with partitions
Scheduler->>Execution: Create executions with partition indices
Scheduler->>Scheduler: Track executions per partition
Scheduler->>Job: Report job progress and completion status
Possibly related PRs
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (5)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (6)
pkg/orchestrator/scheduler/types.go (1)
237-238
: Consider prioritizing executions with highest rankThe TODO comment suggests that executions are approved based on their age (oldest first). To optimize scheduling decisions, consider approving executions based on their rank or priority, ensuring that the most suitable executions are selected.
Do you want me to provide a revised implementation that prioritizes executions based on their rank or other criteria?
pkg/orchestrator/scheduler/service_job_test.go (2)
52-54
: Verify partition index uniqueness in test caseThe test case has two executions with the same partition index (0), which is correct for testing duplicate partition handling. However, consider adding a comment explaining why this is intentional to prevent future confusion.
Add a comment above line 52:
+// Test handling of multiple executions for the same partition WithPartitionedExecution("node0", models.ExecutionStateAskForBid, 0), WithPartitionedExecution("node1", models.ExecutionStateBidAccepted, 0), // Same partition as first one
85-90
: Consider adding test coverage for partition index boundary casesThe test suite would benefit from additional test cases covering edge cases for partition indices.
Consider adding test cases for:
- Invalid partition indices (negative or >= job.Count)
- Missing partition indices in a sequence
- Partition index preservation during rescheduling
pkg/orchestrator/scheduler/ops_job_test.go (1)
48-51
: Document non-partitioned nature of ops jobsThe test correctly uses basic execution structure without partition indices for ops jobs. Consider adding a comment explaining why ops jobs don't use partitioning to prevent future confusion.
Add a comment above the NewExecutions declaration:
+// Ops jobs don't use partitioning as they run on all available nodes NewExecutions: []*models.Execution{
pkg/orchestrator/scheduler/batch_service_test.go (1)
216-238
: Enhance documentation for partition preservation testThe test case effectively verifies partition index preservation during retries, but could benefit from more detailed documentation.
Add more detailed comments to explain the test scenario:
func (s *BatchServiceJobSchedulerTestSuite) TestProcess_ShouldPreservePartitionOnRetry() { - // Test that failed partition is retried and maintains same index + // This test verifies that when an execution fails: + // 1. The scheduler attempts to retry the failed execution + // 2. The retry maintains the same partition index as the failed execution + // 3. Other running partitions are unaffected + // This ensures partition consistency across retries scenario := NewScenario(pkg/orchestrator/scheduler/types_test.go (1)
213-231
: Add edge case tests for partition grouping.While the basic grouping functionality is well-tested, consider adding tests for:
- Empty execution set
- Single partition with multiple executions
- Executions with invalid partition indices
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
pkg/models/execution.go
(1 hunks)pkg/orchestrator/scheduler/batch_job_test.go
(10 hunks)pkg/orchestrator/scheduler/batch_service_job.go
(8 hunks)pkg/orchestrator/scheduler/batch_service_test.go
(7 hunks)pkg/orchestrator/scheduler/daemon_job_test.go
(2 hunks)pkg/orchestrator/scheduler/ops_job_test.go
(2 hunks)pkg/orchestrator/scheduler/service_job_test.go
(7 hunks)pkg/orchestrator/scheduler/types.go
(2 hunks)pkg/orchestrator/scheduler/types_test.go
(3 hunks)pkg/orchestrator/scheduler/utils_test.go
(4 hunks)
🔇 Additional comments (20)
pkg/orchestrator/scheduler/types.go (6)
94-103
: LGTM!The
groupByState()
method correctly groups executions by theirComputeState.StateType
, which is essential for efficient execution management based on state.
111-113
: LGTM!The
filterCompleted()
method accurately filters executions to include only those in theExecutionStateCompleted
state.
145-156
: Well-structured grouping by partition indexThe
groupByPartition()
method effectively groups executions by theirPartitionIndex
, enabling independent operations on each partition and ensuring correct scheduling behavior.
159-170
: Correct identification of completed partitionsThe
completedPartitions()
method efficiently identifies partitions that have at least one execution in theExecutionStateCompleted
state, which is crucial for tracking batch job progress.
172-184
: Effective tracking of used partitionsThe
usedPartitions()
method accurately maps partitions with active (non-discarded) executions, supporting the retry mechanism by identifying partitions that are unavailable for new executions.
186-201
: Accurate computation of remaining partitionsThe
remainingPartitions()
method correctly identifies partitions that require new executions by listing partition indices with no active or completed executions, facilitating effective scheduling and retries.pkg/orchestrator/scheduler/batch_service_job.go (4)
29-51
: Excellent documentation of partitioning logicThe added comments thoroughly explain the partitioning logic, execution lifecycle per partition, and examples. This enhances code readability and maintainability by providing clear guidance on how partitioned jobs are scheduled and managed.
191-206
: Robust approval and rejection logic per partitionThe
approveRejectExecs()
method correctly processes executions on a per-partition basis, ensuring that only one active execution exists per partition at any time. This aligns with the partitioning logic and helps maintain consistency in execution management.
Line range hint
208-260
: Efficient creation of missing executionsThe
createMissingExecs()
method effectively handles the creation of new executions for partitions needing them, accounting for initial scheduling and retries after failures. It also manages scenarios with insufficient matching nodes by scheduling delayed evaluations.
285-298
: Accurate job completion check for batch jobsThe
isJobComplete()
method correctly determines job completion by verifying that all partitions have completed successfully. This ensures that batch jobs are only marked as completed when all required executions have finished.pkg/orchestrator/scheduler/daemon_job_test.go (1)
41-47
: Correct transition to detailed execution objects in testsUpdating the tests to use
NewExecutions
with detailedExecution
objects enhances test specificity and aligns with the new execution structure. This ensures that tests accurately reflect the code changes and support the new partitioning logic.pkg/orchestrator/scheduler/batch_service_test.go (1)
134-136
: LGTM: Well-structured partition assignmentThe test correctly verifies sequential partition index assignment (0, 1) for new executions in a queued state.
pkg/orchestrator/scheduler/utils_test.go (4)
44-52
: LGTM! Well-structured test helper function.The
WithPartitionedExecution
function is a clean addition that properly supports the new partitioning feature. It maintains consistency with the existingWithExecution
pattern while adding partition support.
143-149
: LGTM! Clean struct refactoring.The consolidation of execution-related fields into a single
NewExecutions
slice in bothPlanMatcher
andPlanMatcherParams
improves code maintainability and readability.Also applies to: 153-158
189-216
: Verify partition index validation logic.The partition index validation in the
Matches
method looks good, but let's ensure we're testing edge cases.Consider adding test cases for:
- Invalid partition indices (negative values)
- Out-of-range partition indices
- Duplicate partition indices
278-279
: LGTM! Updated string representation.The
String
method has been properly updated to reflect the new structure.pkg/orchestrator/scheduler/types_test.go (4)
120-183
: LGTM! Comprehensive approval status test cases.The test cases cover various scenarios including completed, running, pending executions, and mixed states. Good use of subtests for better organization.
233-248
: LGTM! Clear partition completion tests.The test effectively verifies the completion status across different partitions.
250-266
: Verify partition usage state transitions.The test covers the basic states well, but we should verify state transitions don't cause issues.
Add test cases for:
- Transitions between states (e.g., Running -> Completed)
- Multiple state changes within the same partition
268-287
: LGTM! Thorough remaining partitions test.Good coverage of both exact count and overflow scenarios. The test effectively verifies the calculation of remaining partitions.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
This PR introduces the foundation for partitioned execution support in bacalhau's orchestrator, allowing the scheduler to track and maintain partition assignments for parallel executions. This is the first step towards full partitioned execution support, focusing only on the orchestrator's ability to track partitions.
Changes
PartitionIndex
field toExecution
model to track which partition (0 to job.Count-1) an execution belongs toFeature Details
When a job specifies Count > 1, the scheduler will now:
Future Work
Next steps will include:
Testing
Added comprehensive test coverage for partition tracking:
The PR sets up the foundation for partition-aware execution, which will be built upon in future PRs that modify the compute side.
Summary by CodeRabbit
Release Notes
New Features
Improvements
Documentation
PartitionIndex
field in the execution model.