Speedup by replace `HashSet` by a `Vec<bool>` #28

natir · 2021-08-17T14:31:00Z

Hello,

According to #26 I create this pull request to replace HashSet indices storage by a Vec of boolean.

I made a simple benchmark by keeping 250x of 500x of nanopore simulated reads of E. coli k12, with a fixed seed (30 samples).

	min [s]	max [s]	mean [s]	relative
set	5.12	11.08	8.507	1
vec	3.82	11.48	7.674	0.90

I Indicate in issue #26 HashSet operation take 10 % of rasusa run time (I use same parameter), replace HashSet by Vec<bool> save this time.
Output are identical.

Some points need discussion:

I can remove benchmark if you want.
methode indices return a Vec<bool> and the number of read selected (as usize), because we can't get the number of reads selected by get the length of Vec
I change filter_reads_into signature to add nb_reads_keep parameter to keep previous behavior. But with this pull request we iterate over all input reads, so I think we can remove the final filter_reads_into test. What is your opinion about this ?

I try to change test less as possible.

natir · 2021-08-17T14:38:19Z

I would like to point out that the gain in computation time is certainly very dependent on the number of reads selected.

natir · 2021-08-18T00:11:50Z

Check failed because bitvec minimum supported rust version is 1.47.0.

So I assume this answer the first point, I must remove benchmark or at least the part with bitvec.

mbhall88 · 2021-08-18T07:40:33Z

Thank you very much @natir

Here is the benchmark from the readme using this version and the previous version against seqtk

$ hyperfine --warmup 10 --runs 50 --export-markdown results-paired.md \
     "$SEQTK_CMD_1" "$SEQTK_CMD_2" "$RASUSA_CMD" "$INDICES_CMD"
Benchmark #1: seqtk sample -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -s 1 r2.fq 147052 > /tmp/r2.fq;
  Time (mean ± σ):     693.8 ms ±  17.4 ms    [User: 398.0 ms, System: 293.1 ms]
  Range (min … max):   678.8 ms … 780.5 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help t
o use the '--warmup' or '--prepare' options.

Benchmark #2: seqtk sample -2 -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq 147052 > /tmp/r2.fq;
  Time (mean ± σ):     739.5 ms ±  28.4 ms    [User: 502.9 ms, System: 235.1 ms]
  Range (min … max):   695.6 ms … 811.8 ms    50 runs

Benchmark #3: ./rasusav042 -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq
  Time (mean ± σ):     655.4 ms ±  84.9 ms    [User: 191.5 ms, System: 300.3 ms]
  Range (min … max):   473.6 ms … 824.7 ms    50 runs

Benchmark #4: ./indices -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq
  Time (mean ± σ):     574.7 ms ±  88.1 ms    [User: 128.2 ms, System: 284.6 ms]
  Range (min … max):   432.3 ms … 784.4 ms    50 runs

Summary
  './indices -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq' ran
    1.14 ± 0.23 times faster than './rasusav042 -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq'
    1.21 ± 0.19 times faster than 'seqtk sample -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -s 1 r2.fq 147052 > /tmp/r2.fq;'
    1.29 ± 0.20 times faster than 'seqtk sample -2 -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq 147052 > /tmp/r2.fq;'

So using a Vec<bool> is indeed a little bit fast than using a HashSet

Pierre Marijon and others added 2 commits August 14, 2021 18:25

Add benchmark to compare indices storage solution

e171b71

Replace HashSet by a Vec<bool> to store index

5f70e7d

mbhall88 added 5 commits August 18, 2021 14:50

short-circuit when reach desired nb reads

90a1e62

remove benches

2fc83b4

rename some variables

6409ba1

update changelog

15f8777

update benchmark

cdac1dc

mbhall88 merged commit 9b5ed7d into mbhall88:master Aug 18, 2021

natir mentioned this pull request Oct 18, 2021

Use FnvHasher on IndexMap zaeleus/noodles#49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speedup by replace `HashSet` by a `Vec<bool>` #28

Speedup by replace `HashSet` by a `Vec<bool>` #28

Uh oh!

natir commented Aug 17, 2021

Uh oh!

natir commented Aug 17, 2021

Uh oh!

natir commented Aug 18, 2021

Uh oh!

mbhall88 commented Aug 18, 2021

Uh oh!

Uh oh!

Speedup by replace HashSet by a Vec<bool> #28

Speedup by replace HashSet by a Vec<bool> #28

Uh oh!

Conversation

natir commented Aug 17, 2021

Uh oh!

natir commented Aug 17, 2021

Uh oh!

natir commented Aug 18, 2021

Uh oh!

mbhall88 commented Aug 18, 2021

Uh oh!

Uh oh!

Speedup by replace `HashSet` by a `Vec<bool>` #28

Speedup by replace `HashSet` by a `Vec<bool>` #28