-
Notifications
You must be signed in to change notification settings - Fork 571
feat: add dynamo components for sglang #1721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
examples/sglang/utils/sgl_utils.py (1)
44-54
: LGTM! Clean utility function implementation.The function follows the same pattern as the existing
parse_sglang_args
function and correctly handles argument parsing with bootstrap port injection. Consider adding a docstring to document the purpose and parameters for consistency with coding standards.examples/sglang/components/worker_inc.py (3)
131-132
: Use cryptographically secure random for bootstrap room generation.The current implementation uses
random.randint()
which is not cryptographically secure. For distributed systems communication, consider using a more secure random generator.Apply this diff to use secure random:
+import secrets + def _generate_bootstrap_room(self): - return random.randint(0, 2**63 - 1) + return secrets.randbits(63)
77-92
: Document the temporary nature of placeholder metrics more clearly.The placeholder metrics implementation should be more clearly documented as temporary to prevent confusion.
Consider updating the warning message to be more explicit:
logging.warning( - "Publishing placeholder metrics in SGLangWorker; these are NOT real engine metrics yet and will be replaced once upstream support lands." + "TEMPORARY: Publishing random placeholder metrics. These will be replaced with real engine metrics once sgl-project/sglang#6721 is merged. Do not rely on these values for monitoring or scaling decisions." )
226-226
: Fix formatting: remove whitespace before colon.Apply this diff to fix the formatting issue:
- new_tokens = data["output_ids"][num_output_tokens_so_far[index] :] + new_tokens = data["output_ids"][num_output_tokens_so_far[index]:]
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/sglang/components/decode_worker_inc.py
(1 hunks)examples/sglang/components/worker_inc.py
(1 hunks)examples/sglang/utils/sgl_utils.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: t-ob
PR: ai-dynamo/dynamo#1290
File: launch/dynamo-run/src/subprocess/sglang_inc.py:80-110
Timestamp: 2025-06-03T10:17:51.711Z
Learning: The sglang `async_encode` method does not support streaming options, so collecting all embeddings before yielding is the correct approach for embedding requests.
examples/sglang/components/decode_worker_inc.py (1)
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.916Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.
examples/sglang/components/worker_inc.py (2)
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.916Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.
🪛 Pylint (3.3.7)
examples/sglang/components/decode_worker_inc.py
[refactor] 34-34: Too few public methods (1/2)
(R0903)
[error] 79-79: No value for argument 'runtime' in function call
(E1120)
examples/sglang/components/worker_inc.py
[error] 298-298: No value for argument 'runtime' in function call
(E1120)
🪛 Flake8 (7.2.0)
examples/sglang/components/worker_inc.py
[error] 226-226: whitespace before ':'
(E203)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (2)
examples/sglang/components/decode_worker_inc.py (1)
57-79
: Well-structured worker implementation.The worker function and initialization logic are correctly implemented. The static analysis hint about missing
runtime
argument is a false positive - the@dynamo_worker
decorator automatically injects theruntime
parameter as confirmed by the retrieved learnings.examples/sglang/components/worker_inc.py (1)
249-298
: Well-architected worker implementation with proper component separation.The worker initialization and main execution logic are well-structured. The static analysis hint about missing
runtime
argument is a false positive - the@dynamo_worker
decorator automatically injects theruntime
parameter based on retrieved learnings. The separation between prefill and decode workers provides good modularity for disaggregated inference.
Yea but if you want this then use dynamo deploy. We are separating concerns here. If you want full deployment help, then use the right tool. Here we are showing examples of how to run using dynamo serve, where yes, the user has to necessarily be more involved. At the dynamo serve level, the contract is the user must manage the processes. At this level we can't have resource management and stuff. What if the user wants to put dynamo serve into a slurm script and now they have to think about srun resource management and the implicit built in resource management. That is too confusing. Its a purposeful regression. We are taking away features and letting users have more control in the way that they are accustomed too. Any user who has used SSH and GPUs can manage these examples. If they want full service deployment where they fill out a config and one click deploy then they can move to dynamo deploy where under the hood it will basically just do this work for them. The regression you are seeing is because dynamo deploy isn't built up to have the features yet. |
I see. Does dynamo deploy only works for k8s or it also works for bare metal? |
At first I was against it working for bare metal. My opinion is we should be focused on K8s and that is where our effort should go. For bare metal we can let people manage it with scripts. However, I was thinking about this last night and I think I would be ok with doing it for bare metal under these conditions:
At that point why not just set up a minkube cluster on your bare metal though... In the end that prioritization of work will be set by Maksim and team. |
@tedzhouhk - this now allows us to get CI properly wired up without having to work across dynamo serve/local orchestration barriers. A python script is much easier to add in (and tear down) |
Description
In order to improve UX and benchmarking efforts, these introduce a set of clean components that leverage the dynamo rust bindings. There is no more reliance on
circusd
,sdk
or any local orchestration.TODO
export PYTHONPATH=$PYTHONPATH:/workspace/examples/sglang/utils
run in container even though I added?Changes
dynamo-serve
used to simply spin each worker in a process. This is now done in a dead simple bash script for all examples. No UX loss (in fact its much much much simpler).depends
or@service
or@endpoint
Examples
Single node examples can be run using either
or
NOTE: the
./launch_agg.sh
just runs the lines above 😁Multinode is where this PR really shines. You can now just export NATS/ETCD and then run
python3
on each nodeTests