Skip to content

Conversation

JojiiOfficial
Copy link
Contributor

No description provided.

This comment was marked as off-topic.

coderabbitai[bot]

This comment was marked as off-topic.

@JojiiOfficial JojiiOfficial changed the title Reduce usage of h-sum in dot_avx calculation Improve h-sum in dot_avx calculation Aug 8, 2025
let hsum = _mm_hadd_ps(lr_sum, lr_sum);
let p1 = _mm_extract_ps(hsum, 0);
let p2 = _mm_extract_ps(hsum, 1);
f32::from_bits(p1 as u32) + f32::from_bits(p2 as u32)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation uses 1 instruction less: https://godbolt.org/z/5no3W3s9K

/// Calculates the hsum (horizontal sum) of four 32 byte registers.
#[target_feature(enable = "avx")]
#[allow(clippy::missing_safety_doc)]
pub unsafe fn four_way_hsum(a: __m256, b: __m256, c: __m256, d: __m256) -> f32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does adding #[inline] here make any difference? (I expect not)

@timvisee timvisee merged commit 152bc5e into dev Aug 13, 2025
16 checks passed
@timvisee timvisee deleted the improve_hsum_calculation branch August 13, 2025 10:21
timvisee pushed a commit that referenced this pull request Aug 14, 2025
* Reduce amount of h-sum calculations in dot_avx

* Improve HSUM calculation and apply 4way-hsum to other places
@timvisee timvisee mentioned this pull request Aug 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants