-
Notifications
You must be signed in to change notification settings - Fork 580
chore: Make llama.cpp a default engine, speedup CI #1177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8514ef5
to
59bcf14
Compare
364c121
to
2ab8eb2
Compare
Co-authored-by: Anant Sharma <anants@nvidia.com> Signed-off-by: Graham King <graham@gkgk.org>
Co-authored-by: Anant Sharma <anants@nvidia.com> Signed-off-by: Graham King <graham@gkgk.org>
Co-authored-by: Anant Sharma <anants@nvidia.com> Signed-off-by: Graham King <graham@gkgk.org>
Co-authored-by: Anant Sharma <anants@nvidia.com> Signed-off-by: Graham King <graham@gkgk.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
container/Dockerfile.tensorrt_llm (1)
303-304
: Implement C bindings build as previously suggested
This exactly matches the earlier recommendation to runcd lib/bindings/c && cargo build --release --locked
in the dev image to generate C include files and the shared library.
🧹 Nitpick comments (2)
container/Dockerfile.tensorrt_llm (2)
207-207
: Consider image size implications of switching Rust profile
You’ve changed the rustup-init profile fromminimal
todefault
, which pulls in extra components (docs, source, rustfmt, etc.). While this unifies the Dockerfiles, it may significantly increase image size. Please verify that this change doesn’t bloat the layer beyond acceptable limits or evaluate if you can selectively install only the necessary components.
299-299
: Refine workspace copy to limit to necessary artifacts
Copying the entire/workspace
directory fromwheel_builder
may include source files and intermediate artifacts you don’t need at runtime. Consider restricting this to only the built artifacts (e.g.,/workspace/dist
,/workspace/target
) to reduce image size and surface area.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
container/Dockerfile.sglang
(2 hunks)container/Dockerfile.tensorrt_llm
(3 hunks)container/Dockerfile.vllm
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- container/Dockerfile.sglang
- container/Dockerfile.vllm
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Build Documentation
- GitHub Check: Build and Test - vllm
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
🔇 Additional comments (2)
container/Dockerfile.tensorrt_llm (2)
276-280
: Consolidated workspace build improves maintainability
Switching tocargo build --workspace
with thedynamo-llm/block-manager
feature simplifies the build steps and ensures all crates are built in one go. This aligns well with your goal of unifying Rust build logic across images.
298-298
: No substantive code on this line.
Previously `mistral.rs` was the default engine for both safetensors and GGUF models. Now it is only the default for safetensors, `llama.cpp` becomes the default for GGUF. Why? - Since #1177 `llama.cpp` is built-in by default, so we can switch. - `llama.cpp` is very very good at running GGUF (but can't run other types of model), so we should switch. Dynamo's multi-engine support gives us a secret super-power: we can use the best engine for this specific format or model. We can still run GGUF with mistralrs by doing `out=mistralrs`.
Previously `mistral.rs` was the default engine for both safetensors and GGUF models. Now it is only the default for safetensors, `llama.cpp` becomes the default for GGUF. Why? - Since #1177 `llama.cpp` is built-in by default, so we can switch. - `llama.cpp` is very very good at running GGUF (but can't run other types of model), so we should switch. Dynamo's multi-engine support gives us a secret super-power: we can use the best engine for this specific format or model. We can still run GGUF with mistralrs by doing `out=mistralrs`.
It does not need any external libraries.
Summary by CodeRabbit