Skip to content

Robust dynamic memory #175

@raphlinus

Description

@raphlinus

This issue outlines our plans for robust dynamic memory in piet-gpu. Right now, we essentially hardcode "sufficiently large" buffers for the intermediate results such as path segments, per-tile command lists, and so on. The actual size needed is highly dependent on the scene, and difficult to predict without doing significant processing. This is also an area where support in modern GPUs is sadly lacking and might be expected to evolve. Thus, a satisfactory design involves some tradeoffs.

The goals are to (a) reliably render correct results, except in extreme resource-constrained environments relative to the scene (ie when the scene is adversarial in its resource requirements), (b) use modest amounts of memory when more is not required, and (c) not impact performance too much. These goals are in tension.

The general strategy is similar to what's already partially supported in the code, but currently lacking a full implementation CPU-side. A small number of atomic counters (currently 1, but this will increase, as described below) provide bump allocation, then all memory writes are conditional on the offset being less than the allocated buffer size. Currently, we have logic to early-out when memory is exceeded, but we may replace that with logic to proceed when allocation was successful at the input stage, so that the atomic counter accurately reflects how much memory would be needed for the successive stage to succeed. (Fine implementation detail: if the number of stages is no more than 32, then atomicOr a bit corresponding to the failed stage, and for early-out do a relaxed atomic load testing bits for input dependencies. If greater than 32, atomicMax the stage number, which is assigned in decreasing dependency order (so if A depends on B, A < B), and again do relaxed atomic load confirming this value is less than the minimum of the input dependencies)

Currently there is one command buffer submission all the way from the input scene to the presentation of the rendered frame buffer. In this proposal, we split that in half and fence back to the CPU after the first submission. The first submission is everything up to fine rasterization. On fence back, the CPU checks the value of the atomic counter, and if it's less than or equal to the buffer size, submits a command buffer for fine rasterization.

If it's greater than buffer size, it reallocates the buffer, rebinds the descriptor sets, and tries again. One reasonable choice for the new size of the buffer is the value of the atomic counter, perhaps rounded up a bit. A heuristic could refine this, for example if the error is at an earlier pipeline stage, a multiplier could be applied on the reasonable assumption that later stages will also require additional memory.

Related to this work, the blend stack memory is moved to a separate allocation and binding, for a definitive solution to #83. In particular, this allows the main memory buffer to be bound readonly in fine rasterization, which is confirmed to make a significant performance difference on Pixel 4. Note that the amount of blend memory required is completely known at the end of coarse rasterization, there are no dynamic choices in fine rasterization.

For a potential WebGPU port, the number of allocation buffers should increase. In particular, a separate buffer is required for atomics (linked list building and coarse winding number in coarse path rasterization), due to divergence of WGSL from standard shader practice in the types of atomic methods.

The blend stack memory is special, in that if it overflows, the coarse pipeline need not be rerun; it suffices to make sure the buffer is large enough. Note that this is an additional motivation to split that into an additional buffer, as interleaving blend stack and per-tile command list allocations would require rerun. Also, it is possible to make binning infallible, the worst case is the number of draw objects times the number of bins (worst case 256 in a 4k x 4k output buffer).

Spatial subdivision

In memory-constrained environments the reallocation may fail, either because the GPU is out of memory or because we wish to constrain memory consumed by piet-gpu to allow for other GPU tasks. Our general strategy to succeed in these cases is spatial subdivision. The theory is that a smaller viewport will require less memory for intermediate results, as well as bound resources such as images. That assumption may not hold for adversarial input, but in general should be fairly sound.

To support this case (a lower priority than above), there is an additional outer loop for spatial subdivision. On first run, the viewport is the entire frame buffer. On failure of the coarse pipeline, bumping into the memory limit, the viewport is split in half (recursively), and the pipeline is run on each half. Once the last coarse pipeline succeeds, the submission of the last fine rasterization command buffer can signal the present semaphore.

Also note that staging of resources can fail (filling the image or glyph atlas beyond memory or texture size limits) even before running the coarse pipeline, and that would also trigger spatial subdivision.

Alternatives considered

  • Fully analyze memory requirements CPU-side before submission. This requires duplicating a substantial amount of the coarse pipeline, which would be time-intensive. In addition, conservative memory estimates (based on bounding boxes rather than actual tile coverage, as well as nesting depth) may be wildly larger than actual, a particular problem for blend memory. Further, if the analysis is not conservative, the entire pipeline may fail.

  • Run an analysis pass GPU-side to estimate memory and fence back before running the "real" coarse then fine pipelines. This has most of the disadvantages of other approaches, but potentially fewer retries as scene complexity steps up. Fundamentally it is wasted work on the happy path.

  • Double memory buffer on failure rather than reading back the atomic counter. This may result in lg(n) retries (for large steps in scene complexity), while the proposed approach is proportional to the number of failed pipeline stages and is expected to be small in practice. On the other hand, failure reporting is simpler.

Simplification of Alloc struct

Currently the code has a MEM_DEBUG define which optionally makes Alloc effectively a slice rather than an offset. This is potentially useful for debugging, but is not enforced by any actual mechanism, and, more importantly, there is no support in the Rust runtime (it was developed for the Gio port). Newer work sometimes bypasses this mechanism, so the code is not consistent. I propose simplifying this so we just use offsets, which will also get rid of write_mem and read_mem. We have to be rigorous in always checking allocation failure, but that's a different place in the code. Medium term I think the best strategy is to have tests that rerun compute stages with varying allocation sizes, so we exercise the allocation failure cases as fully as possible. One other idea that might be worthwhile is allocating (for tests) a little extra "guard" memory past the actual allocation, then checking that none of those values have been overwritten.

offset or index

Currently byte offsets are used throughout, resulting in a lot of >> 2 to convert it into an index into uint[]. The offset is convenient for Rust and when we were contemplating writing HLSL directly (ByteAddressBuffer indices), but in GLSL it would be more natural, and possibly skip an ALU op, to use indices to u32 directly.

I don't think I want to change that at this point, but it's worth considering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    architectureChanges to the way it's put together

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions