-
Notifications
You must be signed in to change notification settings - Fork 586
feat: Update container with better EFA/RDMA support #1333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add a build arg to script to use EFA ref for UCX. Otherwise, use the previoud ref. Signed-off-by: Adit Ranadive <aranadive@nvidia.com>
Signed-off-by: Adit Ranadive <aranadive@nvidia.com>
Signed-off-by: Adit Ranadive <aranadive@nvidia.com>
Signed-off-by: Adit Ranadive <aranadive@nvidia.com>
Signed-off-by: Adit Ranadive <aranadive@nvidia.com>
WalkthroughThe changes introduce improved flexibility and support for building with RDMA and EFA capabilities. The Dockerfile adds installation of RDMA-related dependencies, parameterizes the UCX version with a build argument, and enforces a release build type for NIXL. The build script adds a Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant build.sh
participant Dockerfile
User->>build.sh: Run with or without --make-efa
build.sh->>build.sh: Parse options
alt --make-efa specified
build.sh->>build.sh: Set NIXL_UCX_REF to EFA commit
else
build.sh->>build.sh: Use default UCX ref
end
build.sh->>Dockerfile: Pass NIXL_UCX_REF as build-arg
Dockerfile->>Dockerfile: Install RDMA deps
Dockerfile->>Dockerfile: Checkout UCX at NIXL_UCX_REF
Dockerfile->>Dockerfile: Build UCX and NIXL (release mode)
Poem
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms (2)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
container/Dockerfile.vllm (1)
89-93
: Optimize UCX repository clone depth and checkout
Switching to dynamic$NIXL_UCX_REF
is helpful, but thegit clone
fetches full history. Consider a shallow clone:RUN cd /usr/local/src && \ - git clone https://github.com/openucx/ucx.git && \ + git clone --branch ${NIXL_UCX_REF} --depth 1 https://github.com/openucx/ucx.git && \ cd ucx && \ git checkout $NIXL_UCX_REF && \ ./autogen.sh && ./configure \This reduces image size and clone time.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
container/Dockerfile.vllm
(2 hunks)container/build.sh
(4 hunks)
🔇 Additional comments (6)
container/build.sh (3)
115-115
: Introduce EFA-specific UCX commit reference
The newNIXL_UCX_EFA_REF
variable cleanly captures the EFA-capable UCX commit hash.
252-254
: Handle--make-efa
option to override UCX ref
The new CLI flag correctly assignsNIXL_UCX_REF
to the EFA commit. Consider whether it should be mutually exclusive with other ref overrides (e.g.,--release-build
) to avoid unintended combinations.I recommend verifying that invoking
--make-efa
does not inadvertently skip or override other intended build arguments.
353-354
: Update help message for new EFA option
The help text now documents--make-efa
, which aligns with the added functionality.container/Dockerfile.vllm (3)
82-82
: Parameterize UCX checkout reference
IntroducingARG NIXL_UCX_REF=v1.19.x
makes the UCX version pluggable. Ensure the default tagv1.19.x
matches a valid branch or tag in the UCX repo, or consider using a more precise default (e.g.,v1.19.5
).Please validate that
v1.19.x
resolves to an existing ref inhttps://github.com/openucx/ucx.git
.
127-127
: Enforce release build type for ARM64 Meson
The--buildtype=release
flag ensures optimized NIXL builds on ARM64. This aligns with best practices.
134-134
: Enforce release build type for x86 Meson
Similarly, the release build for non-ARM architectures standardizes performance builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the discussion I think this change will not alter existing EFA installation (not really enabling support), approving in case we want to cherry-pick this for quickly work around the regression.
That being said, should follow up on this discussion for possible alternative for EFA support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comment about supporting EFA by default in future - if we are keeping it opt-in, please document it somewhere easy to find. If we can make it supported by default soon without tradeoffs, let's do that.
Update UCX SHA1 from NIXL. Signed-off-by: Adit Ranadive <aranadive@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
README.md (1)
74-75
: Refine section heading and grammar for clarity.The plural “cloud environments” section only covers AWS; rename to “Note about AWS environments:”. Also add a comma after “AWS” for correct phrasing and format the flag inline as code:
- Notes about cloud environments: - If deploying Dynamo in AWS then make sure to build container with EFA support using the --make-efa flag. + ## Note about AWS environments: + - If deploying Dynamo in AWS, make sure to build the container with EFA support using the `--make-efa` flag.🧰 Tools
🪛 LanguageTool
[typographical] ~75-~75: Consider adding a comma.
Context: ...vironments: - If deploying Dynamo in AWS then make sure to build container with EFA s...(IF_THEN_COMMA)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
README.md
(1 hunks)container/build.sh
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- container/build.sh
🧰 Additional context used
🪛 LanguageTool
README.md
[typographical] ~75-~75: Consider adding a comma.
Context: ...vironments: - If deploying Dynamo in AWS then make sure to build container with EFA s...
(IF_THEN_COMMA)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
Signed-off-by: Adit Ranadive <aranadive@nvidia.com>
Overview:
This PR gets dynamo NIXL EFA/RDMA support in container inline with what NIXL base container will have.
Details:
Modified Dockerfile.vllm to install correct RDMA headers to detect IB devices correctly when reinstalling UCX. Also, the container build script is modified to build for EFA using a specific known good ref (similar to what we will have in NIXL). Updated the NIXL build to use release type.
Where should the reviewer start?
container/Dockerfile.vllm
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Chores
Documentation