Can transition from KSampler to VAE Decode be optimized?

I have a bad 4GB GPU, but it looks like this is *almost* enough to generate big images using this UI (a111's UI can't even start processing of something like this.)
After 100% processing of 1920*1080 image in KSampler I have error messages:
[The latest (today) test version of this ui]
--normalvram.
```
Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.
!!! Exception during processing !!!
...
CUDA out of memory. Tried to allocate 1.98 GiB. GPU 0 has a total capacty of 4.00 GiB of which 0 bytes is free. Of the allocated memory 2.79 GiB is allocated by PyTorch, and 556.89 MiB is reserved by PyTorch but unallocated.
...
CUDA out of memory. Tried to allocate 256.00 MiB. GPU 0 has a total capacty of 4.00 GiB of which 0 bytes is free. Of the allocated memory 3.11 GiB is allocated by PyTorch, and 236.83 MiB is reserved by PyTorch but unallocated. 
```
--lowvram (queue with 3 elements)
```
100%|██████████████████████████████████████████████████████████████████████████████████| 22/22 [07:05<00:00, 19.35s/it]
!!! Exception during processing !!!
Traceback (most recent call last):
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\execution.py", line 151, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\execution.py", line 81, in get_output_data
    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\execution.py", line 74, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\nodes.py", line 241, in decode
    return (vae.decode(samples["samples"]), )
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\comfy\sd.py", line 626, in decode
    pixel_samples[x:x+batch_number] = torch.clamp((self.first_stage_model.decode(samples) + 1.0) / 2.0, min=0.0, max=1.0).cpu().float()
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\comfy\ldm\models\autoencoder.py", line 94, in decode
    dec = self.decoder(z)
          ^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\comfy\ldm\modules\diffusionmodules\model.py", line 734, in forward
    h = nonlinearity(h)
        ^^^^^^^^^^^^^^^
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\comfy\ldm\modules\diffusionmodules\model.py", line 40, in nonlinearity
    return x*torch.sigmoid(x)
             ^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Prompt executed in 448.38 seconds
Exception in thread Thread-1 (prompt_worker):
Traceback (most recent call last):
  File "threading.py", line 1038, in _bootstrap_inner
  File "threading.py", line 975, in run
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\main.py", line 88, in prompt_worker
    comfy.model_management.soft_empty_cache()
  File "D:\ComfyUI_windows_portable_nightly_pytorch\ComfyUI\comfy\model_management.py", line 554, in soft_empty_cache
    torch.cuda.empty_cache()
  File "D:\ComfyUI_windows_portable_nightly_pytorch\python_embeded\Lib\site-packages\torch\cuda\memory.py", line 164, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

About --normalvram mode: I don't know how exactly the image is processed in VAE Decoder, but if it were possible to clear some memory after KSampler processing, it would allow everyone to generate larger images than usual. 

The reasons of the failure seem to be different for normal and low memory modes. So I can't say nothing about --lowram mode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can transition from KSampler to VAE Decode be optimized? #1147

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Can transition from KSampler to VAE Decode be optimized? #1147

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions