Skip to content

Conversation

joehasson
Copy link
Contributor

Hi! This closes #262 where directory inode sizes are mistakenly being excluded from total disk usage calculations. This PR brings the behaviour of dua in line with that of du in both aggregate (f44e19d) and interactive (ef1833a) modes.

For example, given a directory tree structured as followed

/test_dir/
  ├── file1.txt (1KB)
  ├── empty_dir/
  └── populated_dir/
      └── file2.txt (2KB)

dua currently outputs
image
With these changes:
image

Some notes on the changes made:

  • In aggregate mode it suffices simply to remove the !m.is_dir() condition when traversing, as directory metadata can be handled just the same as non-directory file metadata
  • In interactive mode, the changes are slightly more complex due to how entry size is recursively computed. We have existing logic for computing the size of a directory based solely on its contents (not counting the directory entry itself). In order to account for this entry size in the final total, I have added an own_size field to the EntryData struct, which stores the size just of the directory entry itself. Then, when updating our tree of EntryData nodes (in the set_entry_info_or_panic function), we simply add own_size to the total size of the directory's contents, which is computed just as before.
  • No special cases are needed to account for the --apparent-size flag. Following the lead of du, all directories are reported as consuming up a full block on disk regardless of whether reporting apparent size or disk usage.
  • Changing how entry counts are displayed: in interactive mode we are now displaying directory nodes in the entry_count as well as non-directory files. This makes it clearer that directory entry sizes are included in the calculations of the totals reported.

Copy link
Owner

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for making this happen, it's much appreciated. I also love the PR description!

Do you think you could add a journey test that observes that directories now have a size?

Also, I wonder what adding another u128 to the structure does to its size. I am seeing the case where there are millions of files. Probably it doesn't matter as things like name will be responsible for most of the memory usage, but it would be good to have something like assert_eq!(std::mem::size_of::<EntryData>(), X, "this shouldn't change unnoticed") somewhere in the test-suite.

This is a dialogue and not a request, hence I am truly interested in what you think, and it would certainly be fine to merge as is otherwise.

@Byron
Copy link
Owner

Byron commented May 6, 2025

Edit: I just realized that CI is already failing because of it:

the size of this (96) should not exceed 80 as it affects overall memory consumption

Maybe we can give it a try and see what would happen if this was a boolean instead?

@joehasson
Copy link
Contributor Author

joehasson commented May 6, 2025

Thanks for taking a look. Yes, on reflection I agree it makes sense to be cautious about adding that extra field to the EntryData struct. I've had a bit of a think about the right way to proceed here.

Alternative 1 - Still try to add directory inode size in set_entry_info_or_panic, but without storing the own_size field on EntryData

I see what you are suggesting regarding the boolean flag: if we know that an entry is a directory, we can add the directory inode's size to the size field whenever we update our tree of EntryData nodes, without having to also store the extra own_size field. Actually, we don't even need to check a boolean flag, since we only ever call set_entry_info_or_panic for directories anyway. The problem is that its not straightforward to compute the directory inode's size when we need it.

For example we might try using a constant as in the following

pub fn set_entry_info_or_panic(...) {
    ...
    // Doesn't work as directory inodes can vary in size
    node.size = size + DIR_INODE_SIZE_CONSTANT;
   ...
}

but the directory inode size may vary depending on the number of children it has. I also thought about grabbing the inode size from the file metadata here e.g. like this

pub fn set_entry_info_or_panic(...) {
   ...
   let node_own_size = std::fs::metadata(&node.name)
        .map(|m| m.len())
        .expect("This panics because node.name is the immediate directory name not a path")
    as u128;
   
    node.size = size + node_own_size;
    ...
}

but we need a proper path here, not just the directory's name. Presumably we don't want to start storing that on the EntryData structure for the same reasons we don't want to start storing own_size. Also, computing the directory entry size like this seems redundant, since we already process the directory metadata once the TraversalEvent for that directory arrives.

Alternative 2 - Track directory inode sizes during recursive traversal of the filesystem

The original PR changes and Alternative 1 both leave untouched the algorithm which recursively computes the size of the directory's contents. This approach would add three fields to the BackgroundTraversal struct to augment the existing recursive traversal:

  • parent_node_size_per_depth_level, is a stack of directory inode sizes for ancestor nodes along the current branch
  • parent_node_size, analogous to parent_node_idx, is the size of the parent node of the entry we are walking
  • previous_node_size, analogous to previous_node_idx, used to updated parent_node_size when we descend a level in the recursive traversal

We then are able to pass the parent_node_size to set_entry_info_or_panic alongside the current_directory_at_depth, to correctly calculate the size of the tree entry.

Using this approach, the extra cost we pay for storing the directory sizes is linear in the height of the file tree, rather than paying a cost for every single file in the file-system. This seems more reasonable to me.

I have created another branch using this approach here. Happy to take your thoughts on any of this - if you agree that alternative 2 looks good then I will finalise this version, update the PR and go about adding that journey test you mentioned. Cheers!

@Byron
Copy link
Owner

Byron commented May 7, 2025

That's great, if you prefer option 2 I will happily go with that. There might be an option 3, which is to real-world test the impact of the added 16 bytes. As option 2 will add complexity, I think we should benchmark the memory usage first, and then see what by what percentage the consumption rises.

Also, please feel free to push these changes here right away, preferably in their own commit, then it's more straightforward to keep this endeavour in one place.
Thanks again.

@Byron Byron force-pushed the feat/include-directory-inodes-in-size-calculations branch from 2fc1953 to be5a4f6 Compare May 9, 2025 11:59
@Byron
Copy link
Owner

Byron commented May 9, 2025

Sorry, I may have just pushed my changes onto yours with the adjustments of numbers, but I am done now.
It's much easier to see what fails now.

@Byron
Copy link
Owner

Byron commented May 9, 2025

Thank you - once the numbers for Windows are adjusted, could you please squash the commits down into one?
Then there should be nothing in the way of merging it.
Thanks again.

@joehasson
Copy link
Contributor Author

Cool thank you. To me these updated tests capture the fact that directories now have size, so no new journey test is needed. Do you agree?
As for the memory usage issue, I did some memory profiling and found that the peak heap utilisation for the original implementation was ~15% higher than the latest release of dua, compared to only a 2-3% increase for the improved implementation, so I think the slight added complexity is worth it. This was running against my own $HOME directory of around 560k files - I can share the script I used to do this profiling if you would like to take a look.
I see that the tests are now passing on linux but still failing on windows - I'll take a look as to why.

@joehasson joehasson force-pushed the feat/include-directory-inodes-in-size-calculations branch 2 times, most recently from 7004bd9 to 2966e82 Compare May 9, 2025 13:21
@joehasson joehasson force-pushed the feat/include-directory-inodes-in-size-calculations branch from 2966e82 to a93b28e Compare May 9, 2025 13:24
@joehasson
Copy link
Contributor Author

Fixed and squashed - hopefully good to go now. Cheers.

Copy link
Owner

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work, thank you!

@Byron Byron merged commit b5b411b into Byron:main May 10, 2025
2 checks passed
@Byron
Copy link
Owner

Byron commented May 10, 2025

And here is the new release: https://github.com/Byron/dua-cli/releases/tag/v2.30.1 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Directories themselves do not contribute to size like they do in du
2 participants