Skip to content

Conversation

wdbaruni
Copy link
Member

@wdbaruni wdbaruni commented Jan 7, 2025

This PR introduces the foundation for partitioned execution support in bacalhau's orchestrator, allowing the scheduler to track and maintain partition assignments for parallel executions. This is the first step towards full partitioned execution support, focusing only on the orchestrator's ability to track partitions.

Changes

  • Added PartitionIndex field to Execution model to track which partition (0 to job.Count-1) an execution belongs to
  • Modified scheduler to:
    • Assign unique partition indices during scheduling
    • Maintain partition assignments across retries/rescheduling
    • Handle approval/rejection independently per partition
  • No changes to compute execution or engine - this purely adds orchestrator-side tracking

Feature Details

When a job specifies Count > 1, the scheduler will now:

  1. Create N partitions (0 to N-1) based on job.Count
  2. Track which execution belongs to which partition
  3. Ensure exactly one active execution per partition
  4. Preserve partition assignments during rescheduling/retries

Future Work

Next steps will include:

  1. Exposing partition information to compute nodes via environment variables
  2. Modifying engine to populate partition environment

Testing

Added comprehensive test coverage for partition tracking:

  1. Verified partition assignment and maintenance
  2. Tested per-partition scheduling decisions
  3. Validated failure recovery with partition preservation
  4. Confirmed batch vs service job partition handling

The PR sets up the foundation for partition-aware execution, which will be built upon in future PRs that modify the compute side.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for job execution partitioning across the scheduler.
    • Introduced partition index tracking for job executions.
  • Improvements

    • Enhanced job scheduling logic to handle parallel and partitioned job executions.
    • Improved execution management with more granular tracking of job states.
    • Updated test suites to validate new partitioning capabilities.
  • Documentation

    • Updated Swagger documentation to include the new PartitionIndex field in the execution model.

Copy link

linear bot commented Jan 7, 2025

Copy link
Contributor

coderabbitai bot commented Jan 7, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

The pull request introduces a comprehensive enhancement to the job scheduling system by adding support for partitioned executions. The changes span multiple files in the orchestrator scheduler package, focusing on adding a PartitionIndex field to the Execution struct and updating various test suites and utility methods to accommodate this new feature. The modifications enable more granular control over job execution, allowing jobs to be divided into multiple partitions with precise tracking of their states and progress.

Changes

File Change Summary
pkg/models/execution.go Added PartitionIndex field to Execution struct
pkg/orchestrator/scheduler/batch_job_test.go Replaced WithExecution with WithPartitionedExecution, updated test scenarios to reflect partitioning
pkg/orchestrator/scheduler/batch_service_job.go Updated scheduling logic to support partitioned executions, added methods for partition management
pkg/orchestrator/scheduler/batch_service_test.go Updated test cases to use NewExecutions with partition indices, added new test for partition retry
pkg/orchestrator/scheduler/daemon_job_test.go Replaced NewExecutionsNodes with NewExecutions, focusing on execution details
pkg/orchestrator/scheduler/ops_job_test.go Updated matcher parameters to use NewExecutions, enhancing execution representation
pkg/orchestrator/scheduler/service_job_test.go Replaced WithExecution with WithPartitionedExecution, updated test scenarios for partitioning
pkg/orchestrator/scheduler/types.go Introduced new methods for partition-based execution tracking and approval
pkg/orchestrator/scheduler/types_test.go Removed old tests, added new tests focusing on partition management
pkg/orchestrator/scheduler/utils_test.go Added WithPartitionedExecution function, updated matcher types to reflect new execution structure
pkg/swagger/docs.go Added PartitionIndex field to Swagger documentation for models.Execution
pkg/swagger/swagger.json Added PartitionIndex property to Swagger JSON schema for models.Execution
webui/lib/api/schema/swagger.json Added PartitionIndex property to Swagger JSON definition for models.Execution

Sequence Diagram

sequenceDiagram
    participant Job
    participant Scheduler
    participant Execution
    
    Job->>Scheduler: Request execution with partitions
    Scheduler->>Execution: Create executions with partition indices
    Scheduler->>Scheduler: Track executions per partition
    Scheduler->>Job: Report job progress and completion status
Loading

Possibly related PRs

Poem

🐰 Hop, hop, through partitions we go,
Scheduling jobs with a rhythmic flow
Each index a step, precise and neat
Executions dancing to a partition's beat
CodeRabbit's magic makes scheduling sweet! 🚀


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05c7b3c and ec2b183.

⛔ Files ignored due to path filters (1)
  • webui/lib/api/generated/types.gen.ts is excluded by !**/generated/**
📒 Files selected for processing (5)
  • pkg/models/execution.go (3 hunks)
  • pkg/orchestrator/scheduler/batch_job_test.go (10 hunks)
  • pkg/swagger/docs.go (1 hunks)
  • pkg/swagger/swagger.json (1 hunks)
  • webui/lib/api/schema/swagger.json (1 hunks)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (6)
pkg/orchestrator/scheduler/types.go (1)

237-238: Consider prioritizing executions with highest rank

The TODO comment suggests that executions are approved based on their age (oldest first). To optimize scheduling decisions, consider approving executions based on their rank or priority, ensuring that the most suitable executions are selected.

Do you want me to provide a revised implementation that prioritizes executions based on their rank or other criteria?

pkg/orchestrator/scheduler/service_job_test.go (2)

52-54: Verify partition index uniqueness in test case

The test case has two executions with the same partition index (0), which is correct for testing duplicate partition handling. However, consider adding a comment explaining why this is intentional to prevent future confusion.

Add a comment above line 52:

+// Test handling of multiple executions for the same partition
 WithPartitionedExecution("node0", models.ExecutionStateAskForBid, 0),
 WithPartitionedExecution("node1", models.ExecutionStateBidAccepted, 0), // Same partition as first one

85-90: Consider adding test coverage for partition index boundary cases

The test suite would benefit from additional test cases covering edge cases for partition indices.

Consider adding test cases for:

  1. Invalid partition indices (negative or >= job.Count)
  2. Missing partition indices in a sequence
  3. Partition index preservation during rescheduling
pkg/orchestrator/scheduler/ops_job_test.go (1)

48-51: Document non-partitioned nature of ops jobs

The test correctly uses basic execution structure without partition indices for ops jobs. Consider adding a comment explaining why ops jobs don't use partitioning to prevent future confusion.

Add a comment above the NewExecutions declaration:

+// Ops jobs don't use partitioning as they run on all available nodes
 NewExecutions: []*models.Execution{
pkg/orchestrator/scheduler/batch_service_test.go (1)

216-238: Enhance documentation for partition preservation test

The test case effectively verifies partition index preservation during retries, but could benefit from more detailed documentation.

Add more detailed comments to explain the test scenario:

 func (s *BatchServiceJobSchedulerTestSuite) TestProcess_ShouldPreservePartitionOnRetry() {
-    // Test that failed partition is retried and maintains same index
+    // This test verifies that when an execution fails:
+    // 1. The scheduler attempts to retry the failed execution
+    // 2. The retry maintains the same partition index as the failed execution
+    // 3. Other running partitions are unaffected
+    // This ensures partition consistency across retries
     scenario := NewScenario(
pkg/orchestrator/scheduler/types_test.go (1)

213-231: Add edge case tests for partition grouping.

While the basic grouping functionality is well-tested, consider adding tests for:

  • Empty execution set
  • Single partition with multiple executions
  • Executions with invalid partition indices
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6384d4a and 05c7b3c.

📒 Files selected for processing (10)
  • pkg/models/execution.go (1 hunks)
  • pkg/orchestrator/scheduler/batch_job_test.go (10 hunks)
  • pkg/orchestrator/scheduler/batch_service_job.go (8 hunks)
  • pkg/orchestrator/scheduler/batch_service_test.go (7 hunks)
  • pkg/orchestrator/scheduler/daemon_job_test.go (2 hunks)
  • pkg/orchestrator/scheduler/ops_job_test.go (2 hunks)
  • pkg/orchestrator/scheduler/service_job_test.go (7 hunks)
  • pkg/orchestrator/scheduler/types.go (2 hunks)
  • pkg/orchestrator/scheduler/types_test.go (3 hunks)
  • pkg/orchestrator/scheduler/utils_test.go (4 hunks)
🔇 Additional comments (20)
pkg/orchestrator/scheduler/types.go (6)

94-103: LGTM!

The groupByState() method correctly groups executions by their ComputeState.StateType, which is essential for efficient execution management based on state.


111-113: LGTM!

The filterCompleted() method accurately filters executions to include only those in the ExecutionStateCompleted state.


145-156: Well-structured grouping by partition index

The groupByPartition() method effectively groups executions by their PartitionIndex, enabling independent operations on each partition and ensuring correct scheduling behavior.


159-170: Correct identification of completed partitions

The completedPartitions() method efficiently identifies partitions that have at least one execution in the ExecutionStateCompleted state, which is crucial for tracking batch job progress.


172-184: Effective tracking of used partitions

The usedPartitions() method accurately maps partitions with active (non-discarded) executions, supporting the retry mechanism by identifying partitions that are unavailable for new executions.


186-201: Accurate computation of remaining partitions

The remainingPartitions() method correctly identifies partitions that require new executions by listing partition indices with no active or completed executions, facilitating effective scheduling and retries.

pkg/orchestrator/scheduler/batch_service_job.go (4)

29-51: Excellent documentation of partitioning logic

The added comments thoroughly explain the partitioning logic, execution lifecycle per partition, and examples. This enhances code readability and maintainability by providing clear guidance on how partitioned jobs are scheduled and managed.


191-206: Robust approval and rejection logic per partition

The approveRejectExecs() method correctly processes executions on a per-partition basis, ensuring that only one active execution exists per partition at any time. This aligns with the partitioning logic and helps maintain consistency in execution management.


Line range hint 208-260: Efficient creation of missing executions

The createMissingExecs() method effectively handles the creation of new executions for partitions needing them, accounting for initial scheduling and retries after failures. It also manages scenarios with insufficient matching nodes by scheduling delayed evaluations.


285-298: Accurate job completion check for batch jobs

The isJobComplete() method correctly determines job completion by verifying that all partitions have completed successfully. This ensures that batch jobs are only marked as completed when all required executions have finished.

pkg/orchestrator/scheduler/daemon_job_test.go (1)

41-47: Correct transition to detailed execution objects in tests

Updating the tests to use NewExecutions with detailed Execution objects enhances test specificity and aligns with the new execution structure. This ensures that tests accurately reflect the code changes and support the new partitioning logic.

pkg/orchestrator/scheduler/batch_service_test.go (1)

134-136: LGTM: Well-structured partition assignment

The test correctly verifies sequential partition index assignment (0, 1) for new executions in a queued state.

pkg/orchestrator/scheduler/utils_test.go (4)

44-52: LGTM! Well-structured test helper function.

The WithPartitionedExecution function is a clean addition that properly supports the new partitioning feature. It maintains consistency with the existing WithExecution pattern while adding partition support.


143-149: LGTM! Clean struct refactoring.

The consolidation of execution-related fields into a single NewExecutions slice in both PlanMatcher and PlanMatcherParams improves code maintainability and readability.

Also applies to: 153-158


189-216: Verify partition index validation logic.

The partition index validation in the Matches method looks good, but let's ensure we're testing edge cases.

Consider adding test cases for:

  • Invalid partition indices (negative values)
  • Out-of-range partition indices
  • Duplicate partition indices

278-279: LGTM! Updated string representation.

The String method has been properly updated to reflect the new structure.

pkg/orchestrator/scheduler/types_test.go (4)

120-183: LGTM! Comprehensive approval status test cases.

The test cases cover various scenarios including completed, running, pending executions, and mixed states. Good use of subtests for better organization.


233-248: LGTM! Clear partition completion tests.

The test effectively verifies the completion status across different partitions.


250-266: Verify partition usage state transitions.

The test covers the basic states well, but we should verify state transitions don't cause issues.

Add test cases for:

  • Transitions between states (e.g., Running -> Completed)
  • Multiple state changes within the same partition

268-287: LGTM! Thorough remaining partitions test.

Good coverage of both exact count and overflow scenarios. The test effectively verifies the calculation of remaining partitions.

wdbaruni and others added 4 commits January 8, 2025 10:31
@wdbaruni wdbaruni merged commit 559c036 into main Jan 8, 2025
13 checks passed
@wdbaruni wdbaruni deleted the eng-518-schedule-at-partition-level branch January 8, 2025 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant