Performance optimization: Skip compilation of LTO's for adjoint matmuls

In the `tile_matmul_lto_dispatch_func()`, we should be able to save some compilation time by skipping compiling LTO's for adjoint operations if `enable_backward` for the module is `False`.

https://github.com/NVIDIA/warp/blob/3635b9af0f398e4dd9dea5ee8d8d1448a73c174c/warp/builtins.py#L6547-L6576

	# adjA += adjC * B^T - Transpose ~= flipped layout
	(fun_backward_A, lto_backward_A) = warp.build.build_lto_dot(
	M,
	K,
	N,
	out.type.dtype,
	b.type.dtype,
	a.type.dtype,
	out.type.layout,
	tile_flip_layout(b.type.layout),
	a.type.layout,
	arch,
	num_threads,
	builder,
	)
	# adjB += A^T * adjC - Transpose ~= flipped layout
	(fun_backward_B, lto_backward_B) = warp.build.build_lto_dot(
	K,
	N,
	M,
	a.type.dtype,
	out.type.dtype,
	b.type.dtype,
	tile_flip_layout(a.type.layout),
	out.type.layout,
	b.type.layout,
	arch,
	num_threads,
	builder,
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance optimization: Skip compilation of LTO's for adjoint matmuls #644

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance optimization: Skip compilation of LTO's for adjoint matmuls #644

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions