-
Notifications
You must be signed in to change notification settings - Fork 580
build: Add support for sglang runtime image #1770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ate NIXL in runtime image
WalkthroughThe Dockerfile for the Changes
Sequence Diagram(s)sequenceDiagram
participant Builder as Build Stage
participant Runtime as Runtime Stage
Builder->>Builder: Build Python wheels (nixl, ai-dynamo, ai-dynamo-runtime)
Builder->>Builder: Install sglang in editable mode
Builder->>Builder: Copy NATS, etcd, UCX, NIXL binaries/libs
Builder->>Runtime: Copy built wheels and binaries/libs
Runtime->>Runtime: Install system packages (build-essential, libnuma-dev, python3-dev)
Runtime->>Runtime: Install Python wheels from wheelhouse
Runtime->>Runtime: Install additional Python packages (sglang, einops, sgl-kernel, sentencepiece)
Runtime->>Runtime: Set environment variables (plugin dirs, library paths, PYTHONPATH)
Runtime->>Runtime: Set entrypoint to nvidia_entrypoint.sh
Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (6)
container/Dockerfile.sglang (6)
130-134
: Build wheel inwheel_builder
once, then COPY – don't re-compile inbase
The wheel is already needed later in the
wheel_builder
stage.
Re-building it here lengthens the build, duplicates Rust/Python compile time and keeps heavy build deps inbase
. Move the two commands below into thewheel_builder
stage and simplyCOPY
the wheel intobase
(andruntime
).-# Install NIXL Python module -RUN cd /opt/nixl && uv build . --out-dir /workspace/wheels/nixl -# Install the wheel -# TODO: Move NIXL wheel install to the wheel_builder stage -RUN uv pip install /workspace/wheels/nixl/*.whl +# NIXL wheel will be produced in wheel_builder and copied in
401-405
: Heavy build tool-chain left inside runtime image
build-essential
andpython3-dev
add ~150 MB; they are only required to compile wheels.
Because wheels are pre-built inwheel_builder
, drop them here to slim the 11 GB image, or install them with--no-install-recommends
and remove immediately afteruv pip install …
to keep only the shared libs actually needed at runtime.
415-420
: Over-broad package pins inflate image and may overwrite transitive deps
uv pip install einops
andsentencepiece
pull the latest versions each build.
Freeze to known-good versions (ideally viarequirements.runtime.txt
) to avoid silent breakage and improve layer caching.
430-434
: Copying entire repo into runtime bloats layer & leaks build secrets
COPY . /workspace
brings git history, CI config, tests, etc. into the runtime image.
Restrict the copy to what the service truly needs (e.g.,examples/sglang
,deploy/sdk/src
) or remove.git
, docs, and CI artefacts afterwards.
410-413
: Symlinking every venv binary can shadow system tools
ln -sf $VIRTUAL_ENV/bin/* /usr/local/bin/
may replacepython
,pip
, or libc-provided binaries, making debugging harder. Link only the CLI entry-points you need (dynamo-run
,llmctl
, …).
384-389
: Cache NATS/etcd layers or use distro packagesDownloading tarballs every build burns cache. Consider:
- Installing
nats-server
&etcd
viaapt
(already available in Ubuntu 24.04), or- Adding
--mount=type=cache,target=/var/cache/apt
to preserve the downloaded.deb
/.tar.gz
.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
container/Dockerfile.sglang
(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
container/Dockerfile.sglang (1)
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/1770/merge) by nv-tusharma.
container/Dockerfile.sglang
[error] 412-412: Pre-commit hook 'trailing-whitespace' failed. Trailing whitespace was fixed in this file. Run 'pre-commit run --all-files' locally to reproduce.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (1)
container/Dockerfile.sglang (1)
1-436
: Trailing-whitespace pre-commit hook is failingThe CI log shows
pre-commit hook 'trailing-whitespace' failed
.
Runpre-commit run --all-files
locally ordocker run --rm -v $PWD:/code pre-commit/pre-commit:3.7.0
to strip the whitespace before pushing.
ln -sf $VIRTUAL_ENV/bin/* /usr/local/bin/ && \ | ||
rm -r wheelhouse | ||
COPY --from=base /workspace/wheels/nixl/*.whl wheelhouse/ | ||
RUN uv pip install ai-dynamo --find-links wheelhouse && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do
uv pip install ai-dynamo nixl --find-links wheelhouse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update in upcoming PRs.
RUN uv pip install ai-dynamo --find-links wheelhouse && \ | ||
uv pip install ai-dynamo-runtime --find-links wheelhouse && \ | ||
uv pip install nixl --find-links wheelhouse && \ | ||
ln -sf $VIRTUAL_ENV/bin/* /usr/local/bin/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think updating PATH variable is generally better than symlink?
# Copy examples | ||
COPY ./examples examples/ | ||
# Copy examples and set up Python path | ||
COPY . /workspace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we copy specific folders, I think examples, tests, and benchmarks copy should do?
# Setup the python environment | ||
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ | ||
RUN apt-get update && \ | ||
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends python3-dev && \ | ||
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends build-essential python3-dev libnuma-dev && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is libnuma a requirement for sglang?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a required dependency for NIXL integration with the sglang backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment please in dockerfile
rm -rf /var/lib/apt/lists/* && \ | ||
uv venv $VIRTUAL_ENV --python 3.12 && \ | ||
echo "source $VIRTUAL_ENV/bin/activate" >> ~/.bashrc | ||
|
||
# Install SGLang and related packages (sgl-kernel, einops, sentencepiece) since they are not included in the runtime wheel | ||
# https://github.com/sgl-project/sglang/blob/v0.4.9.post1/python/pyproject.toml#L18-51 | ||
RUN uv pip install "sglang[runtime_common]>=0.4.9.post1" && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this version different than what is installed in dev container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version installed in the devel container is built from source commit: sgl-project/sglang#7330
The latest sglang release (0.4.9.post1) contains this fix. The reason why we can't just copy over the build from devel to runtime is there are build failures when trying to build a pip distribution wheel. There isn't a direct way to create a wheel distribution of sglang from the source build: https://docs.sglang.ai/start/install.html#method-2-from-source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use same commit/tag in both places? ideally with an ARG.
Overview:
This PR introduces a slim sglang runtime image which is currently around (~12 GB) in size. Will double confirm before merging.
Details:
agg and disagg examples are passing
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)