Slow fine rasterization (k4) on Adreno 640

I'm getting piet-gpu running on Android (see #82 for snapshot), and running into fine rasterization being considerably slower than expected. The other stages in the pipeline seem fine. I've done some investigation but the fundamental cause remains a mystery.

Info so far: the Adreno 640 (Pixel 4) has a default subgroup size of 64 (though it can also be set to 128 using the vk subgroup size control). That should be fine, as it means that the memory read operations from the per-tile command list are amortized over a significant number of pixels even if CHUNK (the number of pixels written per invocation of kernel4.comp) is 1 or 2. If CHUNK is 4, then the workgroup and subgroup sizes are the same; any larger value results in a partially filled subgroup. There's more detail about the [Adreno 6xx](https://gitlab.freedesktop.org/freedreno/freedreno/-/wikis/A6xx-SP) on the freedreno wiki.

There are some experiments that move the needle. Perhaps the most interesting is that commenting out the body of `Cmd_BeginClip` and `Cmd_EndClip` at least doubles the speed. This is true even when the clip commands are completely absent from the workload. My working hypothesis is that register pressure from accommodating the clip stack and other state is reducing the occupancy.

Another interesting set of experiments involves adding a per-pixel ALU load. The amount of time taken increases significantly with CHUNK. I'm not sure how to interpret that yet. Tweaking synthetic workloads like this may will be the best way to move forward, though I'd *love* to be able to see the asm from the shader compilation. I'm looking into the possibility of running this workload on the same (or similar hardware) but a free driver such as freedreno so that I might be able to gain more insight that way.

I've been using [Android GPU Inspector](https://github.com/google/agi) (make sure to use at least 1.1) but so far it only gives me fairly crude metrics - things like % ALU capacity and write bandwidth scale with how much work gets done, and other metrics like % shaders busy and GPU % Utilization are high in all cases.

There are other things I've been able to rule out: a failure of loop unrolling by the shader compiler. Failure to account ptcl reads as dynamically uniform (I manually had one invocation read and broadcast the results, yielding no change).

I do have some ideas how to make things faster (including moving as much of the math as possible to f16), but the first order of business is understanding *why* it's slow, especially when we don't seem to be seeing similar problems on desktop GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow fine rasterization (k4) on Adreno 640 #83

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow fine rasterization (k4) on Adreno 640 #83

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions