Skip to content

Conversation

natir
Copy link
Contributor

@natir natir commented Aug 17, 2021

Hello,

According to #26 I create this pull request to replace HashSet indices storage by a Vec of boolean.

I made a simple benchmark by keeping 250x of 500x of nanopore simulated reads of E. coli k12, with a fixed seed (30 samples).

min [s] max [s] mean [s] relative
set 5.12 11.08 8.507 1
vec 3.82 11.48 7.674 0.90

I Indicate in issue #26 HashSet operation take 10 % of rasusa run time (I use same parameter), replace HashSet by Vec<bool> save this time.
Output are identical.

Some points need discussion:

  1. I can remove benchmark if you want.
  2. methode indices return a Vec<bool> and the number of read selected (as usize), because we can't get the number of reads selected by get the length of Vec
  3. I change filter_reads_into signature to add nb_reads_keep parameter to keep previous behavior. But with this pull request we iterate over all input reads, so I think we can remove the final filter_reads_into test. What is your opinion about this ?

I try to change test less as possible.

@natir
Copy link
Contributor Author

natir commented Aug 17, 2021

I would like to point out that the gain in computation time is certainly very dependent on the number of reads selected.

@natir
Copy link
Contributor Author

natir commented Aug 18, 2021

Check failed because bitvec minimum supported rust version is 1.47.0.

So I assume this answer the first point, I must remove benchmark or at least the part with bitvec.

@mbhall88
Copy link
Owner

Thank you very much @natir

Here is the benchmark from the readme using this version and the previous version against seqtk

$ hyperfine --warmup 10 --runs 50 --export-markdown results-paired.md \
     "$SEQTK_CMD_1" "$SEQTK_CMD_2" "$RASUSA_CMD" "$INDICES_CMD"
Benchmark #1: seqtk sample -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -s 1 r2.fq 147052 > /tmp/r2.fq;
  Time (mean ± σ):     693.8 ms ±  17.4 ms    [User: 398.0 ms, System: 293.1 ms]
  Range (min … max):   678.8 ms … 780.5 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help t
o use the '--warmup' or '--prepare' options.

Benchmark #2: seqtk sample -2 -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq 147052 > /tmp/r2.fq;
  Time (mean ± σ):     739.5 ms ±  28.4 ms    [User: 502.9 ms, System: 235.1 ms]
  Range (min … max):   695.6 ms … 811.8 ms    50 runs

Benchmark #3: ./rasusav042 -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq
  Time (mean ± σ):     655.4 ms ±  84.9 ms    [User: 191.5 ms, System: 300.3 ms]
  Range (min … max):   473.6 ms … 824.7 ms    50 runs

Benchmark #4: ./indices -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq
  Time (mean ± σ):     574.7 ms ±  88.1 ms    [User: 128.2 ms, System: 284.6 ms]
  Range (min … max):   432.3 ms … 784.4 ms    50 runs

Summary
  './indices -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq' ran
    1.14 ± 0.23 times faster than './rasusav042 -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq'
    1.21 ± 0.19 times faster than 'seqtk sample -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -s 1 r2.fq 147052 > /tmp/r2.fq;'
    1.29 ± 0.20 times faster than 'seqtk sample -2 -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq 147052 > /tmp/r2.fq;'

So using a Vec<bool> is indeed a little bit fast than using a HashSet

@mbhall88 mbhall88 merged commit 9b5ed7d into mbhall88:master Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants