Skip to content

Bad performance when scanning large datasets on HDDs #835

@sergeyvolk

Description

@sergeyvolk

I have noticed that Czkawka performance can be suboptimal when scanning for dups on HDDs. As far as I can see from the source code you are using rayon::par_iter to read/scan several files simultaneously (for example

.into_par_iter()
). This works great for SSDs, but causes terrible performance for HDDs. That's because when you are reading multiple files simultaneously from an HDD where those files are located in different parts of the disk, a lot of time is wasted for seeking (moving mechanical read/write heads back and forth), in my case I observed that an external 4TB HDD which is typically capable of ~100-120MB/sec sequential read speed when reading a single file, was getting ~10-20MB/sec actual read speeds due to this excessive seeking, because czkawka was trying to read 4 files from it simultaneously (disk queue size observed in Windows Performance Monitor was around 4).

In order to get optimal performance for HDDs it would be nice to have an option to limit the number of threads used for scanning. Ideally it should be limited to 1 thread per physical HDD, but I understand that that's non-trivial to do without major refactoring. So in order to provide a quick workaround perhaps we could provide at least on option to limit the number of threads to 1 globally? Even that would be much better than trying to read 4 files from an HDD at once (and it only gets worse if you have a more powerful CPU with more cores). I can see in Rayon FAQ that it's possible to limit the number of threads by setting RAYON_NUM_THREADS env variable, but I'm using Windows GUI and don't know how to set that. Can we add a GUI option to set the number of threads or disable parallelism (i.e. not use par_iter() at all)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions