Skip to content

Conversation

AhmedSoliman
Copy link
Contributor

@AhmedSoliman AhmedSoliman commented Jul 30, 2025

This PR removes the dedicated ingress runtime and allows ingress to share the runtime with default. This is a step towards reducing the number of runtimes, threads, and metric dimensions. In my testing, the impact of this change is negligible and in all cases we'd be improving poll latencies in default runtime by avoiding sync IO operations from log-server and metadata server soon.


Stack created with Sapling. Best reviewed with ReviewStack.

Copy link

github-actions bot commented Jul 30, 2025

Test Results

  7 files  ±0    7 suites  ±0   5m 1s ⏱️ -2s
 54 tests ±0   53 ✅ ±0  1 💤 ±0  0 ❌ ±0 
223 runs  ±0  220 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit 0bfc520. ± Comparison against base commit a96055d.

♻️ This comment has been updated with latest results.

Cleans up worker tasks and adds a new feature for task center to get a TaskGuard for unmanaged tasks that should be cancelled on drop.
This also waits for worker shutdown explicitly before shutting down the rest of the system to allow partitions to drain cleanly with potential improvement in a follow-up PR that would let rocksdb flush to happen once PartitionStoreManager is dropped.

Bonus: This includes a minor fix to let PPM cancel processors while they are still starting up, if they are still opening partition store or mid-initialization delay.
Durability tracker shouldn't hold a strong reference for PartitionStoreManager. It's a minor step towards the worker taking full control over PSM.
Should fix the flaky three-node trim test
- _Actually_ cancel in-flight snapshots on worker shutdown.
- Fix error logging that didn't show the underlying error in log messages and make it stylistically consistent with the other errors.
This PR removes the dedicated ingress runtime and allows ingress to share the runtime with `default`. This is a step towards reducing the number of runtimes, threads, and metric dimensions. In my testing, the impact of this change is negligible and in all cases we'd be improving poll latencies in default runtime by avoiding sync IO operations from log-server and metadata server soon.
Copy link
Contributor

@pcholakov pcholakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to wait for Till's review, but FWIW, if this has marginal benefit then it looks good to go! Thanks @AhmedSoliman

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 for merging :-)

@AhmedSoliman AhmedSoliman merged commit 0bfc520 into main Jul 31, 2025
53 checks passed
@AhmedSoliman AhmedSoliman deleted the pr3607 branch July 31, 2025 21:43
@github-actions github-actions bot locked and limited conversation to collaborators Jul 31, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants