fix: use fp16 dtype for sm75 #1136

zhyncs · 2024-08-17T10:15:35Z

Motivation

Modification

Checklist

Before submitting a PR for review, make sure it has passed verification in your local development environment at least.
Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
Modify documentation as needed, such as docstrings or example tutorials.

zhyncs · 2024-08-17T14:41:24Z

tested with GCP T4

(base) root@hostname:/home/me/sglang# python3 -m sglang.launch_server --model Qwen/Qwen1.5-1.8B-Chat --disable-flashinfer-sampling --mem-frac 0.7
server_args=ServerArgs(model_path='Qwen/Qwen1.5-1.8B-Chat', tokenizer_path='Qwen/Qwen1.5-1.8B-Chat', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='Qwen/Qwen1.5-1.8B-Chat', chat_template=None, host='127.0.0.1', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.7, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=593901843, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=True, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_disk_cache=False, enable_mixed_chunk=False, enable_torch_compile=False, enable_p2p_check=False, enable_mla=False, attention_reduce_in_fp32=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None)
[gpu=0] Init nccl begin.
[gpu=0] Load weight begin. avail mem=14.47 GB
Compute capability below sm80 use float16 due to lack of bfloat16 support.
INFO 08-17 14:35:09 weight_utils.py:225] Using model weights format ['*.safetensors']
INFO 08-17 14:35:09 weight_utils.py:269] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.79s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.79s/it]

[gpu=0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=10.93 GB
[gpu=0] Memory pool end. avail mem=4.02 GB
[gpu=0] Capture cuda graph begin. This can take up to several minutes.
[gpu=0] max_total_num_tokens=35991, max_prefill_tokens=16384, max_running_requests=2047, context_len=32768
INFO:     Started server process [226684]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
INFO:     127.0.0.1:57388 - "GET /get_model_info HTTP/1.1" 200 OK
[gpu=0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
INFO:     127.0.0.1:57398 - "POST /generate HTTP/1.1" 200 OK
The server is fired up and ready to roll!

python/sglang/srt/model_executor/model_runner.py

zhyncs · 2024-08-17T14:48:52Z

The reason this check is not added in check_server_args is because there will be a Cannot re-initialize CUDA in forked subprocess.

zhyncs requested review from Ying1123, merrymercy, hnyls2002 and yzh119 August 17, 2024 10:15

zhyncs marked this pull request as draft August 17, 2024 13:07

zhyncs added the wip label Aug 17, 2024

fix: use fp16 dtype for sm75

e80ef19

zhyncs removed the wip label Aug 17, 2024

zhyncs marked this pull request as ready for review August 17, 2024 14:27

zhyncs commented Aug 17, 2024

View reviewed changes

python/sglang/srt/model_executor/model_runner.py Show resolved Hide resolved

zhyncs merged commit 9208591 into sgl-project:main Aug 17, 2024
5 checks passed

zhyncs deleted the sm75 branch August 17, 2024 14:45

zhyncs mentioned this pull request Aug 17, 2024

[Bug] T4 not work #1058

Closed

4 tasks

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

fix: use fp16 dtype for sm75 (sgl-project#1136)

98d1074

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: use fp16 dtype for sm75 #1136

fix: use fp16 dtype for sm75 #1136

Uh oh!

zhyncs commented Aug 17, 2024 •

edited

Loading

Uh oh!

zhyncs commented Aug 17, 2024

Uh oh!

Uh oh!

Uh oh!

zhyncs commented Aug 17, 2024

Uh oh!

Uh oh!

fix: use fp16 dtype for sm75 #1136

fix: use fp16 dtype for sm75 #1136

Uh oh!

Conversation

zhyncs commented Aug 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

Checklist

Uh oh!

zhyncs commented Aug 17, 2024

Uh oh!

Uh oh!

Uh oh!

zhyncs commented Aug 17, 2024

Uh oh!

Uh oh!

zhyncs commented Aug 17, 2024 •

edited

Loading