Skip to content

Conversation

shibe2
Copy link
Contributor

@shibe2 shibe2 commented Oct 18, 2023

When broadcasting, each 2D plane of src0 is matched with multiple 2D planes of src1. Planes of src0 need to be copied and/or de-quantized only once per multiple GEMM operations. A more natural way to handle this is to create outer loop over src0 and do copying and de-quantisation there.

Previously, de-quantization was performed before each GEMM. Now it is moved to an outer loop.

There is still duplication in case of broadcasting over dimension 3. Handling that properly would require a more substantial change, i.e. storing 3D instead of 2D slices of src0 in VRAM. I don't know of a case where broadcasting over dimension 3 is currently used, so I leave it for the future. Nevertheless, it produces correct results even in this case.

In case of matrix-vector multiplication, de-quantization is done repeatedly because de-quantized data is not stored in RAM.

Tested in isolation and with models that use GQA.

Reduce repeated dequantization of the same data.
@shibe2 shibe2 merged commit 465219b into ggml-org:master Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants