Skip to content

Conversation

mbhall88
Copy link
Owner

The random_sort function was only introducing randomness at the comparison level rather than properly shuffling groups of records with the same position. This meant that records with identical positions would maintain their original relative order instead of being randomly distributed within their group.

Changes:

  • Replace random_sort/random_compare with shuffle_records_by_position
  • Group records by position before shuffling within each group
  • Add comprehensive tests including edge cases and regression tests
  • Fix clippy format string warnings in alignment.rs, fastx.rs, and subsampler.rs
  • Add demonstration test that reproduces the original bug

The new implementation ensures that records with the same genomic position are properly randomized while maintaining correct positional ordering between different positions, fixing the reproducible bias in subsampling results.

Fixes #76

The random_sort function was only introducing randomness at the comparison
level rather than properly shuffling groups of records with the same position.
This meant that records with identical positions would maintain their original
relative order instead of being randomly distributed within their group.

Changes:
- Replace random_sort/random_compare with shuffle_records_by_position
- Group records by position before shuffling within each group
- Add comprehensive tests including edge cases and regression tests
- Fix clippy format string warnings in alignment.rs, fastx.rs, and subsampler.rs
- Add demonstration test that reproduces the original bug

The new implementation ensures that records with the same genomic position
are properly randomized while maintaining correct positional ordering between
different positions, fixing the reproducible bias in subsampling results.

Fixes #76
@mbhall88 mbhall88 requested a review from Copilot July 25, 2025 05:31
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug in the alignment subcommand where records with the same genomic position were not being properly randomized during subsampling. The random_sort function only introduced randomness at the comparison level rather than properly shuffling groups of records with identical positions, leading to reproducible bias in subsampling results.

  • Replace flawed random_sort implementation with proper group-based shuffling
  • Add comprehensive test coverage including regression tests that demonstrate the original bug
  • Fix clippy format string warnings across multiple files

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/alignment.rs Core fix: replaces random_sort with shuffle_records_by_position and adds extensive tests
src/subsampler.rs Updates println! format string to use newer syntax
src/reads.rs Fixes format string in error message
src/fastx.rs Updates Error::new to use Error::other method
src/cli.rs Updates multiple format strings to use newer syntax
Comments suppressed due to low confidence (4)

src/alignment.rs:691

  • [nitpick] Test contains println! statements that will clutter test output. Consider using debug assertions or removing these print statements for cleaner test execution.
        println!("Original order: {original_names:?}");

src/alignment.rs:798

  • [nitpick] Test contains println! statements that will clutter test output. Consider using debug assertions or removing these print statements for cleaner test execution.
        println!(

src/alignment.rs:878

  • [nitpick] Test contains println! statements that will clutter test output. Consider using debug assertions or removing these print statements for cleaner test execution.
        println!("Old approach: {same_order_count} out of 20 iterations kept the same order");

src/alignment.rs:892

  • [nitpick] Test contains println! statements that will clutter test output. Consider using debug assertions or removing these print statements for cleaner test execution.
        println!("New approach: {new_same_order_count} out of 20 iterations kept the same order");

@mbhall88 mbhall88 merged commit d0264ae into main Jul 25, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Subsampling an alignment file results in "user-provided comparison function does not correctly implement a total order" error
1 participant