Skip to content

Conversation

Nicoshev
Copy link
Contributor

Hello, I hope you are doing well

I wanted to propose an additional code optimization for the decompression function:

The variable cpy is now not used from the code paths that copy literals within the fast decompression loop.
This optimizes the usage of the CPU registers, as well as reduces the size of certain code branches.
On average, I got slightly over 1% improvement in decompression speed when decompressing the Silesia corpus using an X64 CPU.

This optimization works very well with the one proposed in the PR "Optimize LZ4_memcpy_using_offset".
When both optimizations are applied together, the usage of the variable cpy is completely removed from the fast decompression loop.

Regards,
Nicolas

@Cyan4973
Copy link
Member

I've been able to observe very small (<1%) gains on some platforms (M1 + clang, 9700k + gcc-11).
On x64 "skylake-era" cpus, performance fluctuations are a mess, sometimes increasing (clang) or decreasing (gcc-9) by a lot, depending on compiler version. This effect seems unrelated to this PR, and is more a symptom of instruction alignment, an uncontrollable side-effect which is particularly sensible on this cpu generation.

In the end, I mostly like that this PR tends to make the code more readable, and it's a good enough reason.

@Cyan4973 Cyan4973 merged commit 76106cb into lz4:dev Aug 14, 2023
@Nicoshev Nicoshev deleted the reduce_cpy_variable_usage branch January 14, 2024 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants