Skip to content

Conversation

lgibelli
Copy link
Contributor

This PR adds support for using an external vLLM server with the pipeline, enabling significant performance improvements by keeping the model loaded between runs.

Changes

  • vllm_server_manager.py: New standalone script to manage a persistent vLLM server with health monitoring and automatic restart capabilities
  • pipeline.py modifications:
    • Added --vllm-url flag to connect to external vLLM servers
    • Added model verification to ensure correct model is loaded
    • Added robust retry logic with exponential backoff
    • Skip GPU check and model download when using external server

Benefits

  • Save 30-60+ seconds per pipeline run by avoiding model reloading
  • Run multiple pipeline instances against the same server
  • Better resource utilization with persistent GPU memory allocation
  • Automatic server restart on crashes (configurable)
  • Clean separation of server infrastructure from processing logic

Usage

Start the server manager

python -m olmocr.vllm_server_manager

Run pipeline with external server

python -m olmocr.pipeline workspace --vllm-url http://localhost:30024 --pdfs *.pdf

Backward Compatibility

Fully backward compatible - the pipeline works exactly as before if --vllm-url is not provided.

Testing

  • Tested with single and multiple PDFs
  • Verified server restart functionality
  • Confirmed model verification works correctly
  • Backward compatibility verified

This PR adds support for using an external vLLM server with the pipeline, enabling significant performance improvements by keeping the model loaded between runs.

Changes:
- Add vllm_server_manager.py: Standalone script to manage a persistent vLLM server with health monitoring and automatic restart capabilities
- Add --vllm-url flag to pipeline.py: Allows connecting to an external vLLM server instead of starting one internally
- Add model verification: Ensures the external server has the correct model loaded before processing
- Add robust retry logic: Handles connection failures with exponential backoff and health checks

Benefits:
- Save 30-60+ seconds per pipeline run by avoiding model reloading
- Run multiple pipeline instances against the same server
- Better resource utilization with persistent GPU memory allocation
- Automatic server restart on crashes with configurable retry logic
- Clean separation of server infrastructure from processing logic

Usage:
1. Start the server manager: python -m olmocr.vllm_server_manager
2. Run pipeline with external server: python -m olmocr.pipeline workspace --vllm-url http://localhost:30024 --pdfs *.pdf

The implementation is fully backward compatible - the pipeline works exactly as before if --vllm-url is not provided.
@jakep-allenai
Copy link
Collaborator

Sounds like a good start, and something we'd like to support. I'd suggest to remove the vllm server manager, the user can just call vllm serve etc as they usually would. Then, try to refactor things to change as little as possible in the main code between the local case, and the external server case. I do like the idea of checking that the right model is loaded, but you can do that in both cases (ex keep the await server ready in both cases)

@lgibelli
Copy link
Contributor Author

thanks for the feedback, will update the PR asap.

@jakep-allenai
Copy link
Collaborator

Closing as we went with @haydn-jones solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants