Add SRT json decode example #2

hnyls2002 · 2024-01-09T08:07:33Z

Only suitable for SRT backend, using dtype's regex and a comma.

hnyls2002 · 2024-01-09T08:09:46Z

Result of SRT json generation:

Generate a JSON object to describe the basic information of a city.
{
  "name": "New York",
  "population": 8500000,
  "area": 3026000,
  "latitude": 40.712786,
  "country": "United States",
  "timezone": "Eastern Standard Time"
}

* test: test cases of combining multiple attention kernel calls to implement a sequence parallel kernel. Verified with 2 sp workers * fix: simplify flashinfer kernel initialization (begin_forward() and end_forward()) * test: add logic for sp worker 1 which is basically the same but with different orders of kernel calls * chore: format tweak * feat: a general seq parallel attention kernel that achieves workload balance * fix: minor tweak loop iteration within ring attention * feat [radix_attention]: seq_parallel kernel with sync communication. TODO: turn communication into async fashion and overlap it with computation * test: update test cases for seq parallel attn kernel. Need to disable kv cache management before testing because we haven't implemented kv cache management for seq parallel yet * chore [radix_attention]: format tweak * feat: async communication within ring attention * fix [parallel_utils]: add missed files * fix [infer_batch]: set default values for newly added sp-related metadata * fix [bench_latency]: minor fixes to input args * feat [parallel_utils]: get actual tp rank and size when both TP and SP are enabled * feat [linear]: add QKVParallelLinear * feat [llama2]: update llama model to use our QKVParallelLinear * feat [model_runner]: initialize model parallel with sequence parallel * fix [infer_batch]: 1. a minor issue when calling get_prefill_indices; 2. flashinfer intialization args * fix [bench_latency]: load model with sp_rank * feat [radix_attention]: automatically dispatch to seq-parallel attn kernel when sp_size > 1 * debug: stash current debug changes * fix [radix_attention]: reshape q tensor before running the kernel * bug fix for sp layout types * fix: adjust tensor layout. TODO: fix many dirty hacks and hardcoded values * fix [wip]: disable p2p communication within ring attention for now. TODO: fix the bug that causes communication hang. * chore [bench_latency]: disable decode for now since we haven't supported it * upstream with correct prefill sp layout * fix early exit on decode SP * chore: tweak format * update layout * bug fix * fix [linear, radix_attention]: fix q head indexes per SP worker to align with GQA setting. * fix [infer_batch]: set up flashinfer kernels for the batch size > 1 case * chore: tweak format * fix [radix_attention]: revert commented-out kv cache store operations in normal attention * fix: adjust k, v tensor shape to align with both TP and SP setting * chore [llama2]: minor adjustment * fix: update bench_latency to evenly distribute each sequence across all SP workers to avoid the layout issue * test: update test cases to align with current kernel in args * fix [model_runner]: initialize TokenToKVPool with correct num_heads and enable KV cache store in SP attention * chore [radix_attention]: clean up comments * fix [model_runner]: correct num_heads in memory profiling as well to avoid OOM * fix [infer_batch]: adopt SP KV cache allocation * feat [linear]: correctly partition q proj along the num_heads dimension with GQA * chore [llama2]: clean up stable variables * feat [infer_batch]: adjust positions to SP layout when preparing input_metadata * feat [infer_batch]: use dedicate paged attn kernel for cross-SP-shard attn * feat [parallel_state]: creat sequence parallel comm groups * test [sp_comm_group]: simple test case with sp_size = 2 * doc [parallel_state]: doc string for our SP group organization * fix [infer_batch]: add padding zeros to positions tensor and out_cache_loc to fix positional encoding and KV cache store * feat [radix_attn, infer_batch]: create masks for padded sequences and now attn works for unevenly-distributed sequenses too * chore [bench_latency]: revert original prompts * fix [parallel_state]: rename "actual" to "kv" * refactor [radix_attention]: unified two cases with differnt comm-comp tradeoffs * chore: rename "actual_tp_[size|rank]" to "kv_tp_[size|rank]" * fix [infer_batch]: ensure prefix_lens is not None in init_flashinfer_args * fix [infer_batch]: only pad positions and out_cache_loc for prefill * chore [linear]: clean up and revise comments * chore [parallel_state]: revise comments * chore [linear]: revise comments and class names * chore [radix_attention]: add defensive checks --------- Co-authored-by: ZYHowell <yhzhuang@cmu.edu>

Enable bench_serving benchmark for SGLang + Add `fork` and `batch` to Example Script

update from geon

Remove duplicate for fp8 groupgemm and remove CN docs

* [SW-223847]: import awq_dequantize if cuda avaialble * fix * fix * fix --------- Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai>

* Fix ut mla-test-1-gpu-amd (sgl-project#4813) Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> * Remove Unintended Capture Batch Sizes in AMD HIP Graph Runner (sgl-project#4638) * [k8s] Clarified the usage of shared memory. (sgl-project#4341) * gemma3: impl `get_attention_sliding_window_size` for attn init (sgl-project#4823) * add partial_json_parser and einops (sgl-project#4827) * fix the release doc dependency issue (sgl-project#4828) * Update doc for DeepSeek-V3-0324 (sgl-project#4825) * deps: lazy import optional dependencies `gguf` and `torchvision` (sgl-project#4826) * Update MMMU Benchmark instructions (sgl-project#4694) * Fix the nightly eval by lowering the threshold of `neuralmagic/gemma-2-2b-it-FP8` (sgl-project#4830) * Basic Cleanup (sgl-project#4833) * Support (1 <= dp < tp) in the dp attention in DeepEP (sgl-project#4770) Co-authored-by: Cheng Wan <cwan39@gatech.edu> * [Fix] Add compressed_tensors as deps (sgl-project#4819) * Fix error due to CustomAllreduce setup failure (sgl-project#4815) Signed-off-by: Kebe <mail@kebe7jun.com> * use default for torch.ops (sgl-project#4835) * [CI] Remove unused imports with Ruff to pre-commit config, only to benchmarks/docs/examples folder (sgl-project#3969) * [Misc] Fix issues reported by torchfix (sgl-project#4837) * Include context length in /v1/models response. (sgl-project#4809) * [Fix] `self.worker` assignment in `TpModelWorker` and refactor references (sgl-project#4788) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix the lora adapter when lora path is none (sgl-project#4799) Co-authored-by: Beichen Ma <mabeichen12@gmail.com> * fix: fix typo of comments in w8a8_fp8.py (sgl-project#4843) * Remove retry in nightly tests (sgl-project#4846) * Fix CI of test_patch_torch (sgl-project#4844) * IPv6 support (sgl-project#3949) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> * ci: add condition for daily docker build (sgl-project#4487) * [Fix] fix output_top_logprobs is not exist (sgl-project#4597) * fix: when use SGLANG_PORT this env,port is str (sgl-project#4528) Signed-off-by: rongfu.leng <lenronfu@gmail.com> * Support Page Size > 1 for FA3 (sgl-project#4832) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * Fix Engine error when enabling DP attention (sgl-project#4648) * fix: Inappropriate lack of Optional type on OpenAI ChatCompletionRequest (sgl-project#4681) * Support controlling nsys start and end range programmatically (sgl-project#4688) * Remove empty tool function name (sgl-project#4704) Signed-off-by: Kebe <mail@kebe7jun.com> * Fix missing arguments in SchedulePolicy and RadixCache initialization in tests. (sgl-project#4712) * get the python version from env (sgl-project#4729) * Fix torch.cuda.MemPool() internal assertion failure (sgl-project#4687) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * Super tiny remove unused code (sgl-project#4750) * Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <czh1137892874@gmail.com> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <cwan39@gatech.edu> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: ch-wan <cwan39@gatech.edu> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: GeLee <leege233@gmail.com> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <yuethe@tencent.com> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Chunan Zeng <zcnrex@gmail.com> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <shishuaiuoe@gmail.com> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * bump v0.4.5 (sgl-project#5117) * Workaround for async copy issue in HPU eager mode (sgl-project#1) Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> * [SW-223847]: Fix sgl_kernel module not available (sgl-project#2) Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai> * [Base] Enable torch compile (sgl-project#4) * [SW-226331] disable dynamic shape in torch compile mode Signed-off-by: Mohit Sinha <msinha@habana.ai> --------- Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: rongfu.leng <lenronfu@gmail.com> Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> Co-authored-by: AinL <gmlwns5176@gmail.com> Co-authored-by: Jiří Suchomel <jiri.suchomel@statsperform.com> Co-authored-by: Juwan Yoo <ryan@tmfi.us> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Daniel Holanda <holand.daniel@gmail.com> Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com> Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Jon Durbin <jon@jondurbin.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Qiaolin Yu <qy254@cornell.edu> Co-authored-by: Beichen Ma <mabeichen12@gmail.com> Co-authored-by: Jiaqi <57028284+ZhuJiaqi9905@users.noreply.github.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Vincent <vincentzhongy+githubvincent4@gmail.com> Co-authored-by: warjiang <1096409085@qq.com> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: rongfu.leng <lenronfu@gmail.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: BroadbentJim <BroadbentJim@users.noreply.github.com> Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai> Co-authored-by: DavidChan <chengwei0519@163.com> Co-authored-by: chaobo jia <91889375+jcbjcbjc@users.noreply.github.com> Co-authored-by: rudy152 <czh1137892874@gmail.com> Co-authored-by: Fr4nk1in <sh.fu@outlook.com> Co-authored-by: yinfan98 <1106310035@qq.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> Co-authored-by: SEPLOS <seplos@aliyun.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: GeLee <leege233@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: zcnrex <zcnrex@gmail.com> Co-authored-by: Kaiyu Yang <yangky@umich.edu> Co-authored-by: renxin <90580890+renxinx@users.noreply.github.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: yuethe <yuethe@tencent.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: Tommy Yang <tommyyang0524@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Rahul Vijayaraghavan <rahul.vijayaraghavan@intel.com> Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Co-authored-by: Jay Thakur <jthakur@habana.ai> Co-authored-by: Anshuman Tripathy <atripathy@habana.ai>

[Triton CodeGen] Fix an issue when generating Triton programs from mugraphs

Merge sgl-project/sglang into main branch#20250825

add testing for cfg

hnyls2002 added 2 commits January 9, 2024 08:05

add srt json example

bea526d

add latitude to show float generation

907a1c4

merrymercy added 2 commits January 9, 2024 20:33

fix

b2ad435

update

6528a88

merrymercy merged commit 331848d into main Jan 9, 2024

merrymercy deleted the srt-json branch January 9, 2024 20:35

Rookie-Kai mentioned this pull request Aug 14, 2024

[Bug] Always Watch Dog TimeOut #1093

Closed

4 tasks

wonderisland mentioned this pull request Sep 19, 2024

[Bug] illegal memory access encountered #1467

Closed

5 tasks

learninmou mentioned this pull request Sep 25, 2024

[Bug] sglang run for few hours, it will stop returning valid response #1270

Closed

5 tasks

stbaione referenced this pull request in nod-ai/sglang Nov 13, 2024

Merge pull request #2 from stbaione/shortfin-benchmark-updates

4c1fc1f

Enable bench_serving benchmark for SGLang + Add `fork` and `batch` to Example Script

jischein mentioned this pull request Jan 7, 2025

[Bug] Long output and issues when running benchmark_serving.py on DeepSeek-V3 #2746

Closed

5 tasks

kbumsik referenced this pull request in DeepAuto-AI/sglang Jan 23, 2025

Merge pull request #2 from gmlwns2000/hip12-offload-add-hip

68a3150

update from geon

CSEEduanyu mentioned this pull request Jan 26, 2025

[Bug] NCCL Crash with SIGSEGV Frequently when deploying deepseek v3 #2803

Closed

5 tasks

zhaotyer mentioned this pull request Feb 14, 2025

[Bug] DeepSeek-R1-BF16 can't output with /v1/chat/completions on 4 node*8*A100 #3572

Closed

5 tasks

lambert0312 mentioned this pull request Feb 18, 2025

Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 #3582

Merged

ToughK mentioned this pull request Feb 18, 2025

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100 #3658

Closed

5 tasks

mahaocong90 mentioned this pull request Feb 26, 2025

[Bug] H20 8 gpu x 2 with --enable-dp-attention occurred CUDA error: an illegal memory access #3892

Closed

5 tasks

verigle mentioned this pull request Feb 27, 2025

[Bug] Model Stuck at Prefill and then throw "Watchdog Timeout" Error After Idle Period (Deepseek-r1:671b on two H100*8) #3836

Closed

5 tasks

zcnrex pushed a commit to zcnrex/sglang that referenced this pull request Mar 5, 2025

Merge pull request sgl-project#2 from hebiao064/fix_fp8_groupgemm

0a2dc38

Remove duplicate for fp8 groupgemm and remove CN docs

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Add SRT json decode example (sgl-project#2)

d76e81c

17Reset mentioned this pull request Mar 18, 2025

[Bug] The --dp-size option will cause an error #4544

Closed

5 tasks

riou-chen mentioned this pull request Apr 17, 2025

[Bug] run eagle3 failed #5448

Closed

NorthmanPKU pushed a commit to NorthmanPKU/sglang that referenced this pull request May 16, 2025

Merge pull request sgl-project#2 from mirage-project/fix

d3a0afc

[Triton CodeGen] Fix an issue when generating Triton programs from mugraphs

dongyibo mentioned this pull request May 19, 2025

[Bug] eagle2【CUDA error: an illegal memory access was encountered】 #6309

Open

5 tasks

yuleiqin mentioned this pull request May 26, 2025

[Bug] main pd version Exception: Failed to encode tensor map: 700 #6590

Closed

5 tasks

EricKing626 mentioned this pull request Jun 10, 2025

[Bug] Deepseek EP + DP Fail and Accuracy Crush #7041

Closed

5 tasks

Xerxes-cn mentioned this pull request Jul 14, 2025

[Bug] Error when running qwen3 moe model on 5090 multi-gpu #8011

Open

5 tasks

lingyaoluu mentioned this pull request Jul 25, 2025

[Bug] MTP CUDA error: an illegal memory access was encountered #8336

Open

5 tasks

zlH518 mentioned this pull request Jul 27, 2025

[Feature] (Willing to PR) Avoid KV cache occupying GPU memory when not used #2542

Closed

2 tasks

killer180 pushed a commit to killer180/sglang that referenced this pull request Aug 10, 2025

Merge pull request sgl-project#2 from sgl-project/main

2825b2e

Merge sgl-project/sglang into main branch#20250825

waynehong666 mentioned this pull request Aug 12, 2025

[Bug] sglang.launch_server error when loading embedding models with --dp 4 #9116

Closed

5 tasks

ericschreiber mentioned this pull request Aug 13, 2025

[Bug] CUDA error: uncorrectable ECC error encountered when using HiCache with xPyD disaggregation. #9151

Open

5 tasks

glenliu21 pushed a commit to glenliu21/sglang that referenced this pull request Aug 15, 2025

Merge pull request sgl-project#2 from glenliu-monoid/preview

bc9a194

add testing for cfg

Copilot AI mentioned this pull request Aug 21, 2025

Add comprehensive memory leak tracking infrastructure for VLM/LLM OOM issues #9422

Closed

gaolaobao mentioned this pull request Aug 25, 2025

[Bug] RTX 5060: RMSNorm failed, same as the #7249 issue, when running qwen2.5-0.5b-instruct model. #9600

Open

5 tasks

raindaywhu mentioned this pull request Aug 22, 2025

[Feature] Shared Expert Standalone for large-scale EP on Ascend #9438

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SRT json decode example #2

Add SRT json decode example #2

Uh oh!

hnyls2002 commented Jan 9, 2024

Uh oh!

hnyls2002 commented Jan 9, 2024

Uh oh!

Uh oh!

Add SRT json decode example #2

Add SRT json decode example #2

Uh oh!

Conversation

hnyls2002 commented Jan 9, 2024

Uh oh!

hnyls2002 commented Jan 9, 2024

Uh oh!

Uh oh!